CHAPTER 2 Data - National Chiao Tung Universityspeed.cis.nctu.edu.tw/~ydlin/course/cn/mcn_writeup/1-in-1-old/... · Chapter 2 Data-link layer ... A technique called byte- or bit-stuffing

1

Chapter 2 Data-link layer Problem Statement

To transmit data over physical links from one node to another or more nodes effectively and efficiently, there is much more to do than simply modulating bit stream into signal. Transmission impairments, such as crosstalk between two adjacent pairs, can unexpectedly change transmission signal and hence result in errors. The transmitter may transmit faster than the receiver can afford. If there are multiple stations that share common transmission media, an arbitration mechanism is required to determine who can transmit its own data. The transmitter has to somehow indicate the destination, and usually needs to name itself so that the receiver knows where the source is. These problems need to be addressed by a set of functions above the physical layer. In the Open Systems Interconnection (OSI) seven-layer model, a specific layer, named data-link layer, provides the service of controlling data communications over a physical link. This layer provides solutions to the above problems. In addition, upper layers are therefore exempt from the duty of controlling parameters in a physical network. These services greatly alleviate upper-layer protocol design and make it virtually independent of physical transmission characteristics.

Throughout this chapter, we intend to equip readers with fundamental background about (1) services and functions provided in the data-link layer, (2) real-world examples of popular data-link protocols, and (3) open source implementation in Linux.

Frankly, there are too many real-world examples to choose. Some are legacy or much less popular nowadays. Some are in the mainstream and still some others are under development. It is nearly impossible to enumerate all of them. We subjectively offer a list of well-known data-link protocols in Table 2.1. Among these protocols, we introduce PPP as it is widely used in dial-up services. Network devices, say routers, also run PPP to carry various network layer protocols over point-to-point links among them. Ethernet technology has occupies more than 95 percent of all local area networks. It is also poised to be ubiquitous in the MAN and WAN. It is undoubtedly a technology we have to know. Wireless links allow greater mobility to make life easy. More devices, such as notebooks, Personal Data Assistant (PDA), cellular phones, and so on, are equipped with the

2

capability to access the Internet. In contrast with desktop PCs, which usually use wired links, these devices are mobile, and hence wireless links are preferred. We choose one typical example for wireless local area network, IEEE 802.11, and another for wireless personal local area network, Bluetooth in this chapter.

PAN/LAN/MAN WAN

Legacy or Minor

Token bus (802.4) Token ring (802.5) DQDB (802.6) HIPPI SMDS Fiber Channel Isochronous (802.9) Demand Priority (802.12) ATM FDDI ISDN

X.25 Frame Relay ATM

Mainstream or

Under development

Ethernet (802.3) Resilient Packet Ring (802.17)Point-to-Point Protocol (PPP) HDLC DOCSIS Wireless PAN/LAN/MAN (802.15/11/16) Bluetooth HIPERLAN HomeRF

Ethernet (802.3) Resilient Packet Ring (802.17) Point-to-Point Protocol (PPP)

Table 2.1 data-link protocols

Section 2.1 provides a general introduction of the functions in data-link layer. Specifically, they include framing, addressing, error control, flow control, and access control. We primarily explain why and how in this section, and leave technical details in specific protocols to later sections as possible. Section 2.2 introduces the Point-to-Point Protocol (PPP). The PPP is a standard protocol that carries multi-protocol packets in the upper layer over a point-to-point link. We present the open-source implementation so that the readers can know how the protocol operates in a real system. Following it is Section 2.3, which introduces the dominating LAN technology, Ethernet. Having evolved more than twenty years, Ethernet is rich in its physical specifications. However, this section will focus more on its functions in the data-link layer, and leaves the physical details in further reading. We also provide open Verilog code to be familiar with the implementation. Section 2.4 discusses wireless LAN. The nature of wireless media, such as mobility and lack of reliability, brings new impact in design different from wired media. Two typical examples, IEEE 802.11 and the Bluetooth technology, will be introduced in this section. Section 2.5 illustrates general concepts of device drivers in Linux. We will go deeply into Ethernet and PPP driver implementations, and list a map to indicate the source codes of the other drivers for the readers to

3

study further. In Section 2.6, we indicate common pitfalls and fallacies. There is much more to learn than the book can cover, we refer the readers to further reading in Section 2.7.

2.1 General Issues

We have presented possible problems over physical communication in the prelude. Sandwiched between the physical layer and the network layer, the data-link layer provides control to physical communications and services to upper network abstraction. The major functions to address these problems in this layer include Major Functions Framing Control information comes along with the bit stream itself to specify the destination node, indicate the upper-layer protocol, check possible error, and so on. To be convenient, data are sent and processed in units of frames. A typical frame usually contains two main parts: control information and the data. Control information is referred to during frame processing by the data-link protocols. The data part comes from upper layers and is encapsulated with the control information into a whole frame. The data-link layer service should somehow delimit bit stream into frames and convert frames into bit stream. Notice that the two terms, packets and frames, are usually used interchangeably. To be more specific, here we refer to the data unit in the data-link layer as frames. Addressing We need an address when writing a letter to our friends. We also need a phone number when dialing up to them. Addressing is needed for the same reason in the data-link layer. The identities of the involved stations are indicated by an address, often presented in a numeric form of some length. Error control Data transmitted over physical media are subject to errors. The errors must be detected by the receiver. The receiver may somehow inform the transmitter that there are errors so that the transmitter knows to retransmit the data. Flow control The transmitter may send at a rate faster than the receiver can afford. In this situation, the receiver has to discard the frames, making the transmitter retransmits the dropped frames. However, this is inefficient. Flow control provides a way to let the receiver slow down the transmitter. Medium Access control There must be an arbitration mechanism when there are multiple stations that want to transmit data over shared media. For a good arbitration mechanism, the access to a shared medium should be fair and the

4

utilization of the shared medium must keep high if many stations are intended to transmit simultaneously.

This section raises general functions in the data-link layer. After these preliminaries, we will exemplify the operations in popular data-link layer protocols in later sections.

2.1.1 Framing Frame Delimiting

Because data are transmitted in raw bit stream in the physical layer, the data-link layer must somehow tell the beginning and the end of a frame. It must also convert frames into raw bit stream for physical transmission. This is called framing. There are many ways to delimit frames. Depending on the basic unit of a frame, which can be byte (or octet) or bit, called byte-oriented or bit-oriented frames, we may use special sentinel characters or bit pattern to mark the frame boundary. We introduce how framing is achieved with examples of bit-oriented HDLC frames and legacy BISYNC frames. There are still other ways to delimit frames. For examples, some Ethernet systems use special physical encoding to mark frame boundary while others identify the boundary simply by the presence or absence of signal1.

A bit-oriented frame can specify a special bit pattern, say 01111110 in HDLC, while a byte-oriented frame can specify special characters, say SOH (start of header) and STX (start of text) to mark the beginning of frame header and data. There may be an ambiguity when normal data characters or bits are the same as the special characters or pattern. A technique called byte- or bit-stuffing is used to solve the ambiguity, as illustrated in Fig. 2.1. A special escape character, namely DLE (data link escape), precedes a special character to indicate the next character is normal data in a byte-oriented frame. Of course, DLE itself is also a special character. Two consecutive DLEs represent a normal character the same as DLE. For bit-oriented frames, the bit pattern 01111110 is used in HDLC. When there are five consecutive 1’s in the normal data bits, a 0 is stuffed after the five 1’s so that the pattern 01111110 never appears in the normal data bits. Both the transmitter and the receiver follow the same rule and hence the ambiguity is solved.

A different approach is in the Ethernet. For example, 100BASE-X can use special encoding to mark the boundary because after 4B/5B encoding (See Section 1.1.1), there are 32 (= 25) codes that can be transmitted over physical

1 Ethernet uses the term ‘stream’ to refer to physical encapsulation of a frame. Strictly speaking, special encoding or presence of signal delimit stream, not frame. However, we do not bother the details here.

5

media while only 16 out of them come from actual data. Other codes can be used as control codes. These codes are uniquely recognizable by the receiver and hence delimit a frame out of a sequence of bit stream. Some other Ethernet systems, say 10BASE-T, does not have signal between frames. They can recognize the frame boundary simply by the presence or absence of signal.

Figure 2.1 (a) byte-stuffing and (b) bit-stuffing Frame Format

A frame is divided into fields that include various kinds of control information for processing and the data from the network layer. Note that the data from the network layer cover control information of higher layers and the actual data. Control information of higher layers is not dealt with and treated as normal data in the data-link layer. Typical fields of control information other than the data fields are listed below: Address: usually indicates the source or the destination address. The receiver knows the frame is for it if the destination address matches its own. It also can respond to the source by filling in the destination address of the outgoing frame with the source address of the incoming frame. Length: may indicate the length of the whole frame or that of the data field. Type: The type of network layer protocol is encoded in this field. The data-link layer protocol can read the code to determine what network layer module, say Internet Protocol (IP), to be invoked to deal with the data field further. Error detection code: is a function of the content in the frame. The transmitter

SOH

start of a frame header

Header information DLE STX

data-link escape

(a)

01111110101011100011101111100000110111001101010101010101111101011 …

start of a frame stuffing bit stuffing bit

five consecutive 1’s five consecutive 1’s

(b)

ETX

start of text end of text

DLE DLE Data portion

6

computes the function and embeds the value in the frame to be transmitted. Upon receiving the frame, the receiver computes in the same way to see if both results match. If they do not match, it implies the content is made changed during transmission. Two common functions are Checksum and Cyclic Redundancy Check (CRC). 2.1.2 Addressing Global or Local Address

An address is an identifier to identify each station from another in communications. Although a name is easier to remember, it is compact to use a numerical address in low-level layer protocols, such as those in the data-link layer. We leave the concept of name as an identifier to Chapter 5 (See Domain Name System).

An address can be globally unique or locally unique. A globally-unique address is unique worldwide, while a locally-unique address is only unique in a local site. In general, a locally-unique address consumes fewer bits and requires the administrator’s efforts to make sure it is locally unique. Since the bit overheads of the address are trivial, globally-unique addresses are preferred nowadays. The administrator simply adds a station at will, and does not need to worry about the conflict of addresses. Address Length How long should an address be? A longer address takes more bits to be transmitted, harder to refer to or remember. On the contrary, a short address may be not enough for global uniqueness. For a locally-unique address, 8 or 16 bits should be enough. However, for a globally unique address, much more bits are necessary. A very popular addressing format adopted by IEEE 802 has 48 bits long. We leave it as an exercise for the readers to discuss whether 48 bits are enough. IEEE 802 MAC Address

We introduce a popular data-link address specified in the IEEE 802 Standards. It is an excellent example because the addressing is widely adopted in many data-link protocols, including Ethernet, Fiber Distribution Data Interface (FDDI), Token Ring, wireless LAN, etc.

While the IEEE 802 specifies the use of either 2-byte of 6-byte addresses, most implementations adopt 6-byte (or 48-bit) addresses. To make sure the address is globally unique, the address is partitioned into two main parts:

7

Organization-Unique Identifier (OUI) and Organization-Assigned Portion, each occupying three bytes. The OUI part is administrated by the IEEE. Each manufacturer can contact the IEEE to apply to own an OUI 2 (See http://standards.ieee.org) for a fee, and then they are in charge of the uniqueness of the Organization-Assigned Portion. In theory, there are totally 248 (around 1015) addresses that can be assigned. This number is large enough for global uniqueness. The format of the address is illustrated in Fig. 2.2.

Figure 2.2 IEEE 802 address format

The first bit in transmission order is reserved to indicate whether the address

is unicast or multicast3. A unicast address is destined for a single station, while a multicast address is destined for a group of stations. A special case of multicast is broadcast, where all bits of the address are 1’s. It is destined for all stations as far as a frame can reach in the data-link layer. Also note that the transmission order of bits in each byte in the address may be different from the order stored in memory. In Ethernet, the transmission order is least significant bit (LSB) first in each byte, called little-Endian. In other protocols, such as FDDI and Token Ring, the transmission order is most significant bit (MSB) first in each byte, called big-Endian. The address is often written in hexadecimal form of separated by dashes or colons, e.g. 00-32-4f-cc-30-58. 2.1.3 Error control

Frames are subject to errors during transmission. The errors should be detected and the transmitter may be demanded to retransmit the frame. In this subsection, we introduce the way to detect errors and what action follows when errors are detected.

Error detection operates by adding additional bits as a function of the frame content to the frame by the transmitter. The receiver performs exactly the same calculation to the frame content to see if the two results match. If there is a

2 See http://standards.ieee.org/regauth/oui/oui.txt about how OUI has been assigned. 3 The second bit can indicate whether the address is globally-unique or locally-unique. The use is seldom, so we ignore it here.

First byte Second byte Third byte Fourth byte Fifth byte Sixth byteOrganization-Unique Identifier (OUI) Organization-Assigned Portion

first bit transmitted 0: unicast address 1: multicast address

8

mismatch, the frame is considered to be subject to errors. We will illustrate two common functions in error detection: checksum and cyclic redundancy check (CRC).

Error Detection Code

The checksum computation simply divides the frame content into blocks of m bits and takes the sum of these blocks. Another powerful technique is cyclic redundancy check. Although it is slightly complicated than checksum, it is very easy to implement in hardware. Suppose there are m bits in the frame content. The transmitter can generate a sequence of k bits as the frame check sequence (FCS) so that the total frame, having m+k bits, can be divided by a predetermined bit pattern, called generator. The receiver divides in the same way and sees if the remainder is zero. If the remainder is not zero, there are errors during transmission.

We show the CRC procedure to generate the FCS with the following example:

frame content F= 11010001110 (11 bits) generator B= 101011 (6 bits) FCS = (5 bits) The procedure goes like the following steps:

Step 1 Shift F by 25, and append it with 5 0’s, yielding 1101000111000000. Step 2 The resulting pattern in Step 1 is divided by B. The process is as follows: (Notice that the computation is all module-2 arithmetic)

1101000111000000101011

11100000111

101011111110101011101011101011

11000010101111011010101111101010101110001 the remainder

9

Step 3 The remainder in the above computation is appended in the original frame content, yielding 1101000111010001. The resulting frame is then transmitted. The receiver divides the incoming frame by the bit pattern 101011 to verify the frame. We leave the verification on the receiver side as an exercise.

With careful design of the generator, the CRC is proved mathematically to be able to detect many kinds of errors, including: 1. single-bit error 2. double-bit error 3. Any burst errors whose burst length is less than the length of the FCS.

The CRC computation can be easily implemented in hardware with exclusive-OR gates and shift registers. Suppose we represent the generator in the form anan-1an-2…a1a0. The bits an and a0 are demanded to be 1. We plot a general circuit architecture that implements the CRC computation in Fig. 2.3. The frame content is shifted into this circuit bit by bit, and the final bit pattern in the shift registers is the FCS, i.e. Cn-1Cn-2…C1C0. Note that the initial values of Cn-1Cn-2…C1C0 are insignificant because they will be finally shifted out after computation. For very high-speed networking, circuits that compute CRC bits in parallel have been designed to speed up the computation.

Figure 2.3 CRC circuit diagram

Error Control Approaches

The receiver can respond the error state of the incoming frame in the following ways: 1. Silently discard 2. Positive acknowledgement when the incoming frame is correct 3. Negative acknowledgement when the incoming frame is incorrect

The transmitter may retransmit the frame that is incorrectly received or just ignore the errors. In the latter case, higher layer protocols, say TCP, can handle the retransmission. 2.1.4 Flow control

Flow control addresses the problem of a fast transmitter and a slow receiver.

frame bitsC0C1

a1a2

Cn-2 Cn-1

an-1

10

When the receiver is overwhelmed, flow control provides a way to let the receiver tell the transmitter, “Hey! You transmit too fast. Please wait!” The simplest method is called stop-and-wait. The transmitter transmits one frame, waits the acknowledgement from the receiver, and transmits the next. This method results in very low utilization of the transmission link. Sliding Window Protocol

An improvement is the sliding window protocol. The transmitter can transmit a window of frames without acknowledgements. When the acknowledgements return from the receiver, the transmitter can move forward to transmit more frames. To track which outgoing frame corresponds to a returned acknowledgement, each frame is labeled with a sequence number. The range of sequence number should be large enough so that a sequence number will not reappear too soon. If so, ambiguity will happen, since we have no means to tell the sequence number represents an old or a new frame.

For example, suppose the window size of the transmitter is 9. It means the transmitter can transmit up to 9 frames without acknowledgements. Suppose the transmitter has transmitted 4 frames and receives an acknowledgement that indicates the first three frames are successfully received. The window will slide 3 more frames. It means 3 more frames can be transmitted without acknowledgements. Sliding window flow control is also a very important technique in Transmission Control Protocol (TCP). We strongly recommend the readers to pay attention to related discussion in Chapter 4. Other Approaches

There are still other methods to implement flow control. For example, the mechanism in Ethernet includes back pressure and PAUSE frame. However, to understand these methods require the knowledge of how these protocols operate. We will leave these flow control techniques to later sections.

2.1.5 Medium access control

Medium access control, also simply referred to as MAC, is needed when multiple stations share a common physical medium. It includes an arbitration mechanism that every station should obey in order to share the common medium fairly and efficiently. There are plenty of techniques to do so. We summarize these techniques into three categories below. Contention-based Approach

11

Multiple stations contend for the use of medium in this approach. A classical example is ALOHA. Stations transmit data at will. If two or more stations transmit at the same time, called a collision, their frames will be garbled, making the throughput low. A refinement is the slotted ALOHA, in which a station is allowed to transmit only in the beginning of a time slot. Further refinements include Carrier Sense and Collision Detection. Carrier sense means the station senses if there is transmission (in signal called carrier) over the shared medium. The transmitter will wait politely before it transmits until the shared medium is free. Collision detection prevents the transmitter from a garbled frame by stopping transmission if a collision is detected. Round-robin Approach The most typical examples are Token Ring, Token Bus, and FDDI. Their mechanisms are similar despite that their structures are different. A token circulates one by one to allow fair share of the medium among stations. A station that owns the token has the right to transmit its frame. Reservation-based Approach

Collision-based approach is inefficient if a collision cannot be detected in time. A frame is complexly garbled before the transmitter is aware of the tragedy. Another approach is reserving before transmitting. The channel is reserved somehow before the transmitter actually transmits it frame. A well-known example is the RTS/CTS mechanism in IEEE 802.11 wireless LAN. We will talk more about this mechanism in Section 2.4.

There is a tradeoff as to the use of reservation. The reservation process itself is an overhead. If the cost of a frame loss is not large, e.g. a short frame, a contention-based approach may be more efficient.

If only two stations are on a point-to-point link, the access control may not be

necessary, depending on the medium characteristics. In such a situation, both stations can transmit at the same time, which we call full-duplex operation. We will talk more about full-duplex operation in Section 2.3.

2.2 Point-to-point protocol

Starting from this section, we will look into real-world protocols to see how the principles introduced in Section 2.1 work in them. This section focuses on the Point-to-Point Protocol (PPP), a widely used protocol we often find when we dial

12

up a modem or use ADSL to the Internet. We would like to emphasize on the characteristics in the data-link layer, framing, addressing, error control, and flow control, in our explanation of the protocol operation. The PPP was derived from an old, but widely used protocol, High-level Data Link Control (HDLC). During its operation are two protocols, Link Control Protocol (LCP), and Network Control Protocol (NCP). As Ethernet extends to home and organizations, with a bridge device such as ADSL modem, connected to the Internet Service Provider (ISP), there is a requirement of PPP over Ethernet (PPPoE). Fig. 2.4 shows the relationship between these components we will introduce in this section.

Figure 2.4 Relationship of PPP-related protocols

2.2.1 High-level Data Link Control (HDLC)

The HDLC protocol is old but it is a basis of many other protocols. Derived from an early protocol, Synchronous Data Link Control protocol (SDLC), by IBM, it was later submitted to ISO and becomes an ISO standard. The HDLC protocol is the basis of many other data-link protocols. For example, the PPP uses HDLC-like framing. IEEE 802.2 Logical Link Control (LLC) is a modification of HDLC. CCITT modifies HDLC as part of the X.25 standard, called Link Access Procedure, Balanced (LAP-B). For its variations, HDLC supports point-to-point and point-to-multipoint link, and half-duplex and full-duplex link. To better understand how HDLC work in terms of the data link functions we have mentioned, we first take a look of the HDLC operation.

13

HDLC Operation: Medium Access Control

In HDLC, stations are either primary or secondary stations. HDLC supports the following three transfer modes. Note that it is the way how stations are controlled to access the medium. Normal response mode (NRM): The secondary station can only passively transmit data in response to the primary’s poll. The response may have one or more frames. In a point-to-multipoint scenario, secondary stations must communicate through the primary station. Asynchronous response mode (ARM): The secondary station can initiate the data transfer without the primary’s poll, but the primary is still responsible for controlling the connection. Asynchronous balanced mode (ABM): Both parties in communication can play the role of the primary and the secondary. It means both stations have equal status. This type of station is called a combined station.

NRM is often used in a point-to-multipoint links, such as those between a computer and its terminals. ARM is rarely used. It has advantages for point-to-point link, but ABM is even better. ABM has less overhead such as the primary’s poll and both parties can have control over the link. It is suitable for a point-to-point link. After offering an impression of the HDLC operation, we go on discussing the function issues. Data link functions: Framing, Addressing, and Error Control We look at the framing, addressing, and error control issues directly from the frame format. Then we will discuss flow control and medium access control. The HDLC frame format is depicted in Fig. 2.5.

Flag Address Control Information FCS Flag

bits 8 8 8 Any 16 8

Figure 2.5 HDLC frame format

Flag: The value is fixed at 01111110 to delimit the beginning and the end of the frame. As illustrated in Section 2.1.1, bit stuffing is used to avoid ambiguity between actual data and the flag value. Address: The address indicates the secondary station involved in transmission, particularly in point-to-multipoint situation. A secondary station works under the

14

control of the primary station, as we have mentioned in the HDLC operation. Control: This field indicates the frame type as well as other control information. There are three types of frames in HDLC: Information, Supervisory, and Unnumbered. We will look at them deeper later. Information: The information field can be of arbitrary length in unit of bits. It carries the payload of data to be transmitted. FCS: A 16-bit CRC-CCITT code is used. HDLC allows both positive and negative acknowledgements. The error control in HDLC is complex. Positive acknowledgements can indicate a successful frame or all frames up to a point, while negative acknowledgements can reject a received frame or a specified frame. We do not go into details about the scenarios in which these acknowledgements are employed. Interested readers are encouraged to read further from our list in Section 2.7. Data link functions: Flow Control Flow control in HDLC is simple. The transmitter keeps a counter to record the sequence number of the next frame to be sent. On the other side, the receiver keeps a counter to record the expected sequence number. It checks whether the sequence number received matches its expectation. If it is and the frame is not garbled, it increases its counter by one and acknowledges the sender by transmitting a message containing the expected sequence. If the received frame is not as expected, or an error is detected, the frame is dropped and a negative acknowledgement is sent back to the sender. Frame Type

The above functions are achieved through various kinds of frames. An information frame, called I-frame, carries data from the upper layer and some control information. The control information has two sequence numbers of three bits to indicate the sequence number of itself and the acknowledged sequence number from the receiver. These sequence numbers are for flow-control and error-control purposes. A poll/final (P/F) is also in the control information to indicate a poll from the primary or the last response from the secondary.

A supervisory frame, called S-frame, carries control information only. As we have seen in the illustration of HDLC frame, both positive and negative acknowledgements are supported for error control. Once there is an error, the transmitter can either retransmit all outstanding frames or only the erroneous frame, as specified in the control information. The receiver can also ask for a temporary stop to the transmitter with an S-frame.

15

An unnumbered frame, called U-frame, is also used for control purpose, but not carries any sequence number, so is the name derived. They include miscellaneous commands for mode settings, information transfer, and recovery. However, we do not go into details here.

2.2.2 Point-to-Point Protocol (PPP)

The PPP is a standard protocol defined by IETF to carry multi-protocol packets over a point-to-point link. It is widely used for dial-up modems and leased lines. To carry multi-protocol packets, it has three main components: 1. An encapsulation method to cap packets from the network layer. 2. A Link Control Protocol (LCP) to handle the cycle of connection setup,

configuration, and tear-down. 3. A Network Control Protocol (NCP) to configure different network-layer options.

We first look at the PPP operation and then study its functions.

PPP Operation The scenario of the PPP operation works like this: First, PPP sends LCP

packets to establish and test the connection. After the connection is setup, the peer may authenticate itself before any network layer packets are exchanged. Then PPP starts to send NCP packets to configure one or more network layer protocols. Once the configuration is done, the network layer packets can be sent over the link. The whole procedure is depicted in the phase diagram shown in Figure 2.6.

Figure 2.6 Phase diagram of PPP connection setup and tear-down

We explain each major transition in the diagram as follows:

Dead to Establish: The transition is invoked by carrier detection or network administrator configuration to use a physical link. Establish to Authenticate: The LCP starts to set up the connection by

DeadUp Establish

Open Authenticate

Success/None

Network Close

TerminateDown

Fail Fail

16

exchanging configuration packets. All options not negotiated are assumed to be default values. Note that only options independent of the network layer are negotiated. The options about network layer configuration are left to the NCP. Authenticate to Network: Authentication is optional in PPP. If required in the link establishment phase, the transition will come to the authentication phase. If the authentication fails, the connection will be terminated. If it is successful, the proper NCP starts to negotiate each network layer protocol. Network to Terminate: The termination happens in many situations. They include the loss of carrier, authentication failure, expiration of an idle connection, user termination, etc. The LCP is responsible for exchanging Terminate packets to close the connection and later the PPP tells the network layer protocol to close.

There are three classes of LCP frames: Configuration, Termination and Maintenance. A pair of Configure-request and Configure-ack can open a connection. Options, such as maximum receive unit, or authentication protocol, are negotiable during the connection setup. The other functions are summarized in Fig. 2.7. The LCP frame is a special case of the PPP frame. Therefore, before we look at the LCP frame format, we first introduce the PPP frame format below.

Class Type Function Configure-request Open a connection by giving desired changes to options

Configure-ack Acknowledge Configure-request

Configure-nak Deny Configure-request because of unacceptable optionsConfiguration

Configure-reject Deny Configure-request because of unrecognizable

options

Terminate-request Request to close the connection Termination

Terminate-ack Acknowledge Terminate-request

Code-reject Unknown requests from the peer

Protocol-reject Unsupported protocol from the peer

Echo-request Echo back the request (for debugging)

Echo-reply The echo for Echo-request (for debugging)

Maintenance

Discard-request Just discard the request (for debugging)

Table 2.2 The LCP frame types Data link functions: Framing, Addressing, and Error Control

The PPP frame is encapsulated in an HDLC-like format, as depicted in Fig. 2.8. Note the flag value is exactly the same as in HDLC. It serves as the delimit characters for framing.

17

Flag 01111110

Address 11111111

Control 00000011 Protocol Information FCS Flag

01111110

bits 8 8 8 8 or 16 Any 16 or 32 8

Figure 2.7 PPP frame format The difference from an HDLC frame is summarized below:

1. The address is fixed at the value 11111111, which is the all-stations address in the HDLC format. Since there is only one peer in a point-to-point link, there is no need to indicate an individual station address at all.

2. The control code is fixed at the value 00000011, which corresponds to an unnumbered frame in the HDLC format. This implies that no sequence numbers and acknowledgement are used in the PPP by default. RFC 1663 defines an extension to make the PPP connection reliable. Interested readers are referred to this document.

3. A Protocol field is added to indicate what kind of network layer protocol, say IP or IPX the frame is carrying. The field length is 16 bits by default, but it can be reduced to 8 bits using the LCP negotiation.

4. The maximum length of the Information field is 1500 bytes by default. This value is called the Maximum Receive Unit (MRU). Other values for MRU are negotiable.

5. A 16-bit FCS is used by default. Through the LCP negotiation, it can be extended to 32 bits. The receiver simply drops the received frame if an error is detected. The responsibility of retransmission falls on the upper-layer protocols.

Data link functions: Flow Control and Medium Access Control PPP is full-duplex and there are only two stations in a point-to-point link. No medium access control is necessary. On the other hand, PPP does not provide flow control. Flow control is also left to upper-layer protocols. LCP and NCP negotiation

The LCP frame is a PPP frame with the Protocol field equal to 0xc021, where 0x stands for a hexadecimal number. The negotiation information is embedded in the Information field as four main fields. They are Code to indicate the type of LCP, Identifier to match requests and replies, Length to indicate the total length of the four fields, and Data to carry the negotiation options.

Since IP is the dominating network-layer protocol in the Internet, we are particularly interested in IP over PPP. We will soon introduce the NCP for IP –

18

Internet Protocol Control Protocol (IPCP) in the next subsection.

2.2.3 Internet Protocol Control Protocol (IPCP)

IPCP is a member of NCPs to configure IP over PPP. As mentioned in the last subsection, PPP first establishes a connection with LCP and then uses NCP to configure the network layer protocol it carries. Once these configurations are done, data packets can be transmitted over the link.

IPCP uses a similar frame format as the LCP. Its frame is also a special case of the PPP frame, with the Protocol field equal to 0x8021. The exchange mechanism is the same as that of the LCP. Through IPCP, IP modules on both peers can be enabled, configured, and disabled.

IPCP provides the configuration options: IP-Addresses, IP-Compression-Protocol, and IP-Address. The first is obsolete and is replaced by the third. The second indicates the use of Van Jacobson’s TCP/IP header compression. The third allows the peer to provide an IP address to be used on the local end. Once IPCP negotiation is done, normal IP packets can be transmitted over the link with the Protocol field equal to 0x0021 on the PPP frame. 2.2.4 PPP: Open Source Implementation Introduction

The Linux PPP implementation is primarily composed of two parts: kernel (PPP driver) and user-level (PPP daemon) parts. In the past, the PPP packages have to contain updated kernel drivers. This is no longer necessary, as the current 2.2 and 2.4 kernel sources contain up-to-date drivers. Besides, the Linux PPP implementation is capable of being used both for initiating PPP connections (as a `client') or for handling incoming PPP connections (as a `server'). Note that this is an operational distinction, based on how the connection is created, rather than a distinction that is made in the PPP protocols themselves.

The PPP protocol consists of two parts. One is a scheme for framing and encapsulating packets, the other is a series of protocols called LCP, IPCP, PAP and CHAP, for negotiating link options and for authentication. Similarly, PPP packages consists of two parts: a PPP driver (supported by Linux kernel) which handles PPP's low-level framing protocol, and a user-level program called pppd which implements PPP's negotiation protocols.

The PPP driver establishes a network interface and passes packets between

19

the serial port, the kernel network code and the pppd. Also, it handles the issues of data link layer (e.g. framing, error detection) described in previous subsections. The pppd negotiates with the peer to establish the link and sets up the PPP network interface. Besides, pppd includes support for authentication, so it can control which other systems may make a PPP connection and what IP addresses they may use. IP packets go directly to the kernel network code, so once pppd has negotiated the link, it in practice lies completely dormant until you want to take the link down, when it negotiates a graceful disconnect. PPP Driver

A PPP driver is made of the PPP generic layer and the PPP channel driver.

Figure 2.8 presents the PPP architecture: There are asynchronous (/drivers/net/ppp_async.c) and synchronous (/drivers/net/ppp_synctty.c) PPP channel drivers in Linux kernel. The asynchronous ppp channel driver is used for asynchronous serial ports, while the synchronous one is used for synchronous serial ports. We know that synchronous communication is designed for better bandwidth allocation than asynchronous communication, and it’s about 30% faster actually. For this reason, there is something different between async and sync ppp driver. No error control is put into practice in sync ppp driver, and it is left to be done in the hardware device. However, there is error control in async ppp driver. Most PC serial devices such as mice, keyboards and modems are asynchronous, whereas the high-speed WAN adaptors are synchronous. Hence, asynchronous PPP enables Linux to

Component Function

pppd handles control-plane packets

kernel handles data-plane packets

PPP generic layer

handles PPP network interface , /dev/ppp device, VJ compression, multilink

PPP channel driver

handles encapsulation, framing, and error control

pppd kernel

PPP generic layer

PPP channel driver

tty device driver

serial line

Figure 2.8 ppp architecture

20

route IP datagrams over telephone networks, and synchronous PPP enables Linux to route IP datagrams over dedicated leased-lines. Following, we explain the ppp_synctty.c in Linux kernel 2.4.

There are two important data structures in this PPP driver, one is PPP channel and the other is PPP unit. A PPP channel provides a way for generic PPP code to send and receive packets. A PPP unit corresponds to a PPP network interface device and it represents a multilink bundle. Figure 2 lists some useful fields of these two data structures :

Fig 2. is the relation for the outgoing flow functions, and Fig 2. is the

description of them.

PPP channel Field Function file stuff for read/write ops operations for this channel

ppp ppp unit we’re connected to

clist link in list of channels per unit

PPP Unit Field Function file stuff for read/write channel list of attached channels xmit_ pending

a packet ready to go out

dev network interface device

Figure 2. fields of data structures

Figure 2. flowchart of outgoing flow functions

21

Function Description ppp_start_xmit put 2-byte PPP protocol number on the front of skb ppp_write take out the file->private_data ppp_file_write allocate skb, copy data from user space, to PPP

channel or PPP unit ppp_xmit_process to do any work queued up on the transmit side that can

be done now ppp_channel_push send data out on a channel ppp_send_frame VJ compression ppp_push handles multiple link start_xmit ppp_sync_send ppp_sync_send send a packet over an tty line ppp_sync_tx_munge encapsulation and framing ppp_sync_push push as mush data as possible tty->driver.write write data to tty device driver Fig 2. is the relation for the incoming flow functions, and Fig 2. is the description of them.

Figure 2. description of outgoing flow functions

22

Function Description ppp_sync_receive take out the tty->disc_data ppp_sync_input stuff the chars in the skb process_input_packet strip address/control field ppp_input take out the packets that should be in the channel

queue ppp_do_recv check if the interface closed down ppp_receive_frame decide if the received frame is a multilink frame ppp_receive_nonmp_frame VJ decompression if proto==PPP_VJC_COMP,

and decide whether it’s a control-plane frame or a data-plane frame

ppp_receive_mp_frame reconstruction of multilink frames netif_rx push packets into the queue for kernel skb_queue_tail push packets into the queue for pppd

The multiple channels available with ISDN services motivated the development of multilink (bundle) PPP, as documented in RFC-1990. Multilink PPP arranges several independent connections between a fixed pair of endpoints

Figure 2. flowchart of incoming flow functions

Figure 2. description of incoming flow functions

23

to function logically as one. For example, if a router has an ISDN BRI interface, it could transfer data at 64Kbps on one "B" channel, but then in times of higher load could connect a second "B" channel and so have an aggregate rate of 128KBps. It could also be used where there is a leased line connection to a remote site, but in times of increased load it could again connect an ISDN "B" channel to temporarily increase throughput. This technique isn’t limited to ISDN. Any number of PPP connections of varying speeds and different link types may be bundled together. However, a bundle must still connect between the same two endpoints. The multilink procedure encodes PPP fragments within PPP frames. Each link in a bundle begins as an independent and standalone connection. Later negotiations establish the multilink option and uniquely identify the bundle a physical connection participates in. Once the bundle is active, the multilink procedure fragments, sequences, and reassembles logical PPP frames. Fig 2 illustrates this procedure. Frames containing fragments have the protocol field value 0x003d. A logical frame size is limited by a negotiated maximum received reconstructed unit (MRRU). This value may be very large, since each fragment may have sizes within the MRU established for individual connections. However, practical upper limits do exist due to resources necessary to sort and assemble fragments, as well as detect their loss. pppd 2.2.5 PPP over Ethernet (PPPoE) The Need of PPPoE

As Ethernet technology becomes cheap and dominant, it is not uncommon that users have their own Ethernet LAN in their home or office. On the other hand, broadband access technologies, say ADSL, have a boosting development as a

Header Data

Logical PPP frame

H D PPP

PPP

Serial connection #1

Serial connection #2

H D

H D H D

24

method to access the Internet from home or office. Users on an Ethernet LAN are likely to access the Internet through the same broadband bridging devices at the same time. For service providers, they desire a method to have access control and billing on a per-user basis, just similar to conventional dial-up services.

PPP has conventionally been a solution to build point-to-point relationship between peers. However, an Ethernet network consists of multiple stations by nature. The PPP over Ethernet protocol (PPPoE) is designed to coordinate the two conflicting philosophies. It creates a virtual interface on an Ethernet interface so that individual station on a LAN can establish a PPP session with a remote PPPoE server, known as Access Concentrator (AC) located in the ISP through common bridging devices. Each user on the LAN sees a PPP interface just like what is seen in a dial-up service, but the PPP frames are actually encapsulated in the Ethernet frames. Through PPPoE, the user’s computer obtains an IP address, and the ISP has an easy way to track the IP address to a specific user name and password. PPPoE Operation

The PPPoE runs in two stages: the Discovery stage and the PPP Session stage. In the Discovery stage, the MAC address of the access concentrator is discovered. A unique PPPoE session id is also assigned to the session. Once a PPP session is established, both peers enter the PPP Session stage and do what exactly a PPP session does, say LCP negotiation.

The Discovery stage proceeds in the following four steps: 1. The station that would like to access the Internet broadcasts an Initiation frame

to ask for remote access concentrators to return their MAC addresses. 2. The remote access concentrator responds its MAC addresses. 3. The original station selects one access concentrator. It sends a

Session-Request frame to the selected access concentrator. 4. The access concentrator generates a PPPoE session id and returns a Confirm

frame with the id. The PPP Session stage runs in the same way as a normal PPP session, as

explained in Section 2.2.2, only being carried on the Ethernet frames. When the LCP terminates a PPP session, the PPPoE session is torn down as well. New PPP session requires a new PPPoE session starting from the Discovery stage.

To terminate a PPP session, a normal PPP termination process is followed. PPPoE allows an explicit Terminate frame to close a session sent by either the initiating station or the access concentrator. Once the Terminate frame is sent or received, no further frame transmission is allowed, even for normal PPP

25

termination frames. PPPoE: Open Source Implementation

2.3 Ethernet (IEEE 802.3)

Originally proposed by Bob Metcalfe in 1973, Ethernet was once one of the competitors of the LAN technology, and is now the winner. Over more than 20 years, Ethernet has been reinvented many times to accommodate up-to-date needs, resulting in the 1552-page IEEE 802.3 Standard. Despite this, the story is still rolling into the future. New standards keep coming up as time goes by. In this section, we invite you to appreciate the picture and philosophy of Ethernet. We also bring the hot topics in the current development. Enjoy it! 2.3.1 Ethernet development: A big picture

As the title of the standard, “Carrier Sense multiple access with collision detection (CSMA/CD) access method and physical layer specification” suggests, Ethernet is most distinguished from other LAN technologies, such as Token Bus and Token Ring, by its medium access method. A lab at Xerox gave birth to the method, which as later standardized by DEC, Intel and Xerox in 1981, known as the DIX Ethernet. Although this standard bore little resemblance to the original design at Xerox, the essence of CSMA/CD was preserved. In 1983, the IEEE 802.3 Working Group approved a standard based on the DIX Ethernet with only insignificant changes. This standard becomes the well known IEEE 802.3 Standard. Since Xerox relinquished the trademark name “Ethernet”, there is no distinction nowadays when we refer to the Ethernet and the IEEE 802.3 Standard. In fact, the IEEE 802.3 Working Group has been leading the Ethernet development as of its first version of the standard. The milestones in the Ethernet standards are illustrated in Fig. 2.8.

26

Figure 2.8 Milestones in the Ethernet Standards

Ethernet has experienced several significant revisions during the past 20

years. We list the major trends below.

From low to high speed: Starting from a prototype running at 3 Mb/s, Ethernet is on its steps to move toward 10 Gb/s in the year of 2002 – a boost of more than 3000 times in speed. An astonishing development as it is, the technology still keeps cheap, making it widely accepted around the world. A gigabit Ethernet adapter has broken the cost barrier to be less than $100 in 2001. We would almost be sure that Ethernet will be ubiquitous. From shared to dedicated media: The original Ethernet runs on a bus topology of coaxial cables. Multiple stations share the bus with the CSMA/CD MAC algorithm. As of the development of 10BASE-T, dedicated media between two devices becomes the major. Although not sufficient, dedicated media are necessary to the later development of full-duplex Ethernet. Full-duplex allows both stations to transmit over the dedicated media simultaneously, which in effect doubles the bandwidth! Form LAN to MAN and WAN: Ethernet was well known as a LAN technology. Two factors help the technology move toward the MAN and WAN market. The first is the cost. Ethernet has low cost in implementation because of its simplicity. Besides, it takes less pains and money in interoperability if the MAN and WAN are also Ethernet. The second comes from full duplex. Full duplex eliminates the need of CSMA/CD, and hence lifts the distance restriction due to this method. The data can be transmitted as far as a physical link can reach. We will talk more about full

27

duplex in the next subsections. The medium is getting richer: The term “ether” comes from physics, which was once thought of as the medium to propagate electromagnetic waves through space. Although Ethernet never uses ether to transmit data, it does carry messages onto a variety of media: coaxial cables, twisted pairs, and optical fibers. “Ethernet is Multimedia!” -- The amusing words by Rich Seifert in his book Gigabit Ethernet best depict the scenario. We list all the 802.3 family members in terms of speed and media in Table 2.3.

medium

speed Coaxial cable Twisted pairs Fiber

1 Mb/s 1BASE5 (1987)

10 Mb/s

10BASE5 (1983) 10BASE2 (1985) 10BROAD36 (1985)

10BASE-T (1990)

10BASE-FL (1993) 10BASE-FP (1993) 10BASE-FB (1993)

100 Mb/s

100BASE-TX (1995) 100BASE-T4 (1995) 100BASE-T2 (1997)

100BASE-FX (1995)

1 Gb/s

1000BASE-CX (1998) 1000BASE-T (1999)

1000BASE-SX (1998) 1000BASE-LX (1998)

10 Gb/s

10GBASE-R (2002) 10GBASE-W (2002) 10GBASE-X (2002)

Table 2.3 The 802.3 family Note that not all members are commercially successful. For example,

100BASE-T2 has never been a commercial product. In contrast, some are so successful that almost everybody can find a Network Interface Card (NIC) of 10BASE-T or 100BASE-TX behind a computer on a LAN. The number in the parentheses indicates the year the specification was or will be approved by the IEEE. The Ethernet nomenclature

Ethernet is rich in its physical specification, as we haven seen in Table 2.3. The notation follows the format {1/10/100/1000/10G}{BASE/BROAD}[-]phy. The first item is the speed. The second item depends on whether the signaling is baseband or broadband. Almost all Ethernet signaling is baseband, except the very unpopular 10BROAD36. The third item is the maximum length in unit of 100m in the beginning. No dash is between the second and the third item. The

28

convention had later been changed to indicate the physical specifications, such as medium type, signal encoding, etc. A dash is located between the second and the third item.

2.3.2 The Ethernet MAC Ethernet Framing, Addressing, and Error control

The 802.3 MAC sublayer is the medium-independent part of the Ethernet. Along with the Logic Link Control (LLC) sublayer specified in IEEE 802.2, they compose the data-link layer in the OSI layer model. The functions associated with the MAC sublayer include data encapsulation and media access control. Let us take a look at the untagged4 Ethernet frame in Fig. 2.9 first. Through the frame format, we will first introduce framing, addressing and error control and leave issues of medium access control and flow control later.

Preamble SFD DA SA T/L Data FCS

Bytes 7 1 6 6 2 46-1500 4

SFD: Start of Frame Delimit DA: Destination Address SA: Source Address T/L: Type length

FCS: Frame Check Sequence

Figure 2.9 Ethernet frame format Preamble: This field is used to synchronize the physical signal timing on the receiver side. Its value is fixed at 1010….10 in transmission order5, totally 56 bits long. Note that this field is not used to mark the frame boundary. The boundary is marked by special physical encoding, or the presence (absence) of signal, depending on the PHY. For example, 100BASE-X Ethernet converts the first byte of the Preamble, /1010/1010/, into two special code groups /J/K/ of the value /11000/10001/ using 4B/5B encoding. The 4B/5B encoding converts 1010 (in transmission order) to 01011 for normal data. No bit- or byte-stuffing is needed because there is no ambiguity. Similarly, 100BASE-X appends two special code groups /T/R/ of the value /01101/10001/ after a frame to mark the end. SFD: This field indicates the start of the frame with the value 10101011 in transmission order. Historically, the DIX Ethernet Standard specified an 8-byte preamble with exactly the same value as the first two fields in an 802.3 frame. They are only different in nomenclature. 4 An Ethernet frame can carry a VLAN tag. We will see that frame format when we cover VLAN in Section 2.3.4. 5 Ethernet transmission is in Little-Endian bit ordering. We will talk about transmission ordering in Section 2.6.

29

DA: This field is the 48-bit destination MAC address in the format we introduced in Section 2.1.2. SA: This field is the 48-bit source MAC address. Type/Length: This field has two meanings for historical reasons. The DIX Standard specified the field to be a code of protocol type, say IP, while the IEEE 802.3 Standard specified the field to be the length of the data field6 and left the protocol type to be processed by the LLC sublayer. The 802.3 Standard later (in 1997) approved the type field, resulting in the dual roles of this field today. The way to distinguish is simple. Because the data field is never larger than 1500 bytes, a value less than or equal to1500 means a length field. A value larger than or equal to 1536 (=0x600) means a type field. The values in between are intentionally not defined. In fact, most frames uses the type field because the dominating network layer protocol, IP, uses the type field. Data: This field carries the data, as the name says it. It varies from 46 to 1500 bytes. FCS: This field carries a 32-bit CRC code as a frame check sequence. If the receiver finds an incorrect frame, it silently discards the frame. The transmitter knows nothing about whether the frame is discarded. The responsibility of a retransmission is left to upper-layer protocols, such as TCP. This approach is quite efficient because the transmitter does not need to wait an acknowledgement for the next transmission. The error is not a big problem because the bit error rate is assumed to be very low in the Ethernet physical layer.

The frame size is variable. We often exclude the first two fields and say a minimum Ethernet frame has 64 (=6+6+2+46+4) bytes and a maximum Ethernet frame has 1518 (=6+6+2+1500+4) bytes. People may think the maximum length is not long enough so that the header overhead is larger, compared with Token Ring or FDDI. We will analyze the Ethernet efficiency in Section 2.6. Medium Access Control: Transmission and Reception Flow

We now come to the show of how a frame is transmitted and received. Here you will see how the CSMA/CD mechanism works in great detail. Fig. 2.10 shows what role the MAC sublayer plays during the frame transmission and reception.

6 There is a wide misconception that the Length field indicates the frame size. This is not true. The frame end is marked by special physical encoding or the absence of signal, depending on the PHY. The Ethernet MAC can easily count how many bytes it has received in a frame.

MAC client (IP, LLC, …)

data encapsulation data decapsulation

transmit medium management receive medium management

30

Figure 2.10 Frame transmission and reception

The transmission flow is presented in Fig. 2.11. We list the procedure below: 1. The MAC client (IP, LLC, …) asks for a frame transmission. 2. The MAC sublayer prepends and appends MAC information (Preamble, SFD,

DA, SA, type, FCS...) to the data provided by the MAC client. 3. In half-duplex mode, i.e., with the CSMA/CD method, carrier is sensed to

determine if the transmission channel is busy. If yes, the transmission is deferred until the channel is clear.

4. Wait for a period of time called inter-frame gap (IFG). The time length is 96 bit times for all flavors of Ethernet. The bit time is the duration of one bit transmission and thus the reciprocal of the bit rate. This unit is so convenient that we do not need to say “In the 10 Mb/s systems, the IFG is 9.6 µs; in the 100 Mb/s systems, the IFG is 0.96µs; ……”. The IFG allows time for the receiver to do possible processing, such as interrupts and pointer adjustment, for the incoming frame.

5. Start to transmit the frame. 6. In half-duplex mode, the transmitter should keep monitoring if there is a

collision during transmission. The way to detect collisions depends on the attached medium. Multiple transmissions on a coaxial cable result in higher absolute voltage levels than normal. For twisted pairs, a collision is asserted by perceiving received signal on the receive pair while transmitting.

7. If there is no collision during transmission, the frame is transmitted until done.

transmit data encoding receive data decoding

line signal

MAC sublayer

Physical layer

31

Figure 2.11 Frame transmission flow

If there is a collision detected in half duplex mode, the following steps go on:

8. The transmitter transmits a 32 bits long jam signal to ensure the collision is long enough that all involved stations are aware of it. The pattern of the jam signal is unspecified. Common implementations are keeping transmitting 32 more bits of data or transmitting alternating 1’s and 0’s by leveraging the circuit the generates the preamble.

9. Abort the current transmission and attempt to schedule another transmission! 10. The maximum number of attempts to retransmit is 16. If still not able to

transmit, abort the frame. 11. On an attempt to retransmit, a back-off time is chosen randomly from the

range of 0 to 2k-1, where k = min(n, 10) and n is the number of attempts. Note that the range grows exponentially, so the algorithm is referred to as truncated binary exponential back-off. The duration of the slot time is 512 bit times for 10/100 Mb/s Ethernet and 4096 bit times for 1 Gb/s Ethernet. We will talk about the reason when we discuss Gigabit Ethernet in Section 2.3.3.

12. Wait for the back-off time and attempt to retransmit

Transmit start

Assemble frame

Half duplex and Carrier sensed?

yes

no

Wait interframe gap

Start transmission

Half duplex and Collision detected?

Transmission done?

Transmission OK.

yes

no

Send jam

Increment attempts

Too many attempts?yes

Abort transmission

Compute backoff

Wait backoff time

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

no

yes

no

1.

32

The reception flow is illustrated in Fig. 2.12. We list the procedure below:

Figure 2.12 Frame reception flow

1. The arrival of a frame is detected by the physical layer of the receiver. 2. The receiver decodes the received signal, passing the data except the

preamble and SFD, up to the MAC sublayer. 3. The receiving process goes on as long as the received signal is coming. When

the signal ceases, the incoming frame is truncated to an octet boundary. 4. If the frame size is too small (less than 512 bits), it is thought of as a collision

fragment and dropped.

Done with errors

Proper octet boundary?

yes

no

Frame decapsulation

Reception OK

8.

9.

Receive start

Start receiving

Receiving done?no

yes

Frame too small? (collision fragment)

yes

no

Address recognized?no

Frame too long?

yes

yes

Valid FCS? no

no

1.

2.

3.

4.

5.

6.

7.

33

5. If the destination address is not for the receiver, the frame is dropped. 6. If the frame is too long, it is dropped and the error is recorded for management

statistics. 7. If the frame has an incorrect FCS, it is dropped and the error is recorded. 8. If the frame size does not meet an integer number of octets, it is dropped and

the error is recorded. 9. If everything is OK, the frame is decapsulated and the fields are passed up to

the MAC client. CSMA/CD: Open Source Implementation Can collision cause bad performance?7 The term collision sounds terrible! However, collision is part of the normal arbitration mechanism of CSMA/CD. It does not come from a system malfunction. Frankly, collision can cause a garbled frame, but it is not as bad because of collision detection. A transmission can stop if a collision is detected. Before further analysis of wasted bit times because of a collision, we first answer a critical question: where a collision can occur.

We model the frame transmission in Fig. 2.13.

Figure 2.13 Collision detection with propagation delay

Suppose station A transmits a minimum frame of 64 bytes, and the

propagation before the frame arrives at station B is t. With carrier sense, station B is likely to transmit before t. Further suppose station B transmits just at time t, which results in a collision. The collision takes another time t to be propagated back to station A. If station A finishes transmitting the minimum frame before the round-trip time 2t, it has no way to schedule a retransmission, and the frame is

7 It was a question once asked on the newsgroup comp.dcom.lans.ethernet. I like the hilarious answer from Rich Seifert, “Yes. My old Toyota never quite performed the same after I hit that tree.”

34

lost. For CSMA/CD to function normally, the round-trip time should be less than the time to transmit a minimum frame. It means the CSMA/CD mechanism limits the extent between two stations in a collision domain. This limitation causes difficulty in half-duplex Gigabit Ethernet design. We will talk more about this issue when we introduce Gigabit Ethernet. Because the minimum size is 64 bytes, it also means a collision must occur during the first 64 bytes for frames of any size under the distance limitation. If more than 64 bytes have been transmitted, the collision is not possible to occur under normal operation because of carrier sense by other stations. If we also take the 32-bit jam into consideration, the actual number of bits in a frame that have been transmitted plus the jam cannot be more than 511 bits. Why? If more than 512 bits, the receiver will think of these bits as a frame, rather than a collision fragment (See Step 4 in the reception flow). Therefore, the maximum number of wasted bit times is 511 + 64 (from the preamble) + 96 (from the IFG) = 671. This is only a small portion for a large frame. In addition, we must emphasize it is the worst case. For most collisions, they occur during the preamble because the distance between two transmitting stations is not that far. In this case, the number of wasted bit times is only 64 (from the preamble) + 32 (from the jam) + 96 (from the IFG) = 192. We will discuss more on the performance issue on Section 2.6. Maximum Frame Rate

How many frames can a transmitter (receiver) transmits (receives) in a second? That’s an interesting question, especially when you design or analyze a packet processing device, say a switch. You are interested in knowing how many frames a second your device may need to process.

Frame transmission begins with a 7-byte Preamble and a 1-byte SFD, as we have seen in the transmission flow. Intuitively, to reach the maximum number of frames per second, all frames should be of minimum size, i.e. 64 bytes. Do not forget that there is an IFG of 12 bytes (= 96 bits) between two successive frame transmissions. Totally, a frame transmission occupies (7+1+64+12)*8 = 672 bit times. In a 10 Mb/s system, the maximum number of frames per second is therefore 10 * 106 / 672 = 14880. This value is referred to as maximum frame rate.

Full-duplex MAC

Early Ethernet uses coaxial cables as the transmission media. Most of them are later replaced by twisted pairs because of easier management. A twisted pair cable connects a station and a concentration device, such as a hub or switch. This

35

topology becomes a very popular one. For popular 10BASE-T and 100BASE-TX, a wire pair in a twisted pair cable is dedicated to either transmitting or receiving8. A collision is thus defined by perceiving received signal on the receive pair while transmitting. However, this is inefficient enough. Since the medium is dedicated, why does it need “arbitration” to be used?

In 1997, the IEEE 802.3x Task Force added full-duplex operation in Ethernet. That is, transmission and reception can proceed at the same time. Note that in full duplex mode, there is no carrier sense or collision detection because they are not needed. There is no “multiple access” on a dedicated medium. Therefore, CS, MA, and CD, all disappear! This is a quite significant change to Ethernet because Ethernet was most known for its CSMA/CD. We summarize three conditions that should be satisfied to run full-duplex Ethernet: 1. The transmission medium must be capable for transmitting and receiving on

both ends without interference. 2. The transmission medium should be dedicated for exactly two stations,

forming a point-to-point link. 3. Both stations should be able to and be configured in full-duplex mode.

Note that the IEEE 802.3 Standard explicitly rules out the possibility of running full-duplex mode on a repeater hub. The bandwidth in the hub is shared, not dedicated. Three typical scenarios of full-duplex transmission are the station-to-station link, the station-to-switch link, and the switch-to-switch link.

Full-duplex Ethernet is an impact. It in effect doubles the bandwidth between two stations. It also lifts the distance limitation because of CSMA/CD. This is very important for high-speed Ethernet, as we will discuss it in Section 2.3.3. Nowadays, virtually all Ethernet interfaces support full duplex. Both interfaces can perform auto-negotiation to determine whether full duplex is supported by both parties. If yes, both will operate in full duplex because of higher efficiency. Ethernet flow control

Flow control in the Ethernet depends on the duplex mode. In half-duplex mode, if the receiver cannot afford more incoming frames, it could transmits carrier, say a series of 1010….10, on the shared medium until it can afford more frames. The carrier will be sensed by the transmitter, making it defer its subsequent transmission. This technique is called false carrier. The receiver also can force a collision whenever any frame transmission is detected. This forces the transmitter to back off and reschedule its transmission. This technique is referred

8In 1000BASE-T, transmission and reception can happen simultaneously in a pair. Arbitration is still not necessary at the cost of sophisticated DSP circuits to separate the two signals.

36

to as force collision. These two techniques are collectively called back pressure. However, back pressure is void in full duplex mode because CSMA/CD is

ignored. IEEE 802.3 specifies a PAUSE frame to do flow control in full duplex mode. The receiver explicitly sends a PAUSE frame to ask for a stop. Upon receiving the PAUSE frame, the transmitter stops transmitting immediately. The PAUSE frame can carry a field, pause_time, to tell the transmitter how long it should stop. Still, more often than not, pause_time is set to be the maximum and another PAUSE frame with pause_time = 0 is sent to the transmitter to tell it that it can continue when the receiver can accept more frames.

Flow control is optional in the Ethernet. It can be enabled by the user or through auto-negotiation. IEEE 802.3 Standard provides an optional sublayer between MAC and LLC, namely MAC Control sublayer. MAC Control sublayer defines MAC Control frames to provide real-time manipulation of MAC sublayer operation. The PAUSE frame is a kind of MAC Control frame. In fact, it is the only one kind defined to date. Flow Control: Open Source Implementation 2.3.3 New blood in the Ethernet

Innovations always come in the cyberspace. Ethernet is no exception. In recent years, Ethernet has boosted itself into the line of gigabit networking. As soon as the standard came out in 1998 and 1999, a new study group started to study the 10 Gigabit technology, which turned into the IEEE 802.3ae Task Force. At the time of writing, the 10 Gigabit Ethernet is in the Draft stage so far and is expected to be approved as the Standard in 2002. Another new Task Force, the IEEE 802.3ah9, was organized in July 2001. It has started to stipulate a new standard, Ethernet First Mile, in their first meeting in October, 2001. This new standard even pushes the Ethernet into the subscriber line market. In this subsection, we will take you into the future world of Ethernet. Gigabit Ethernet

The IEEE 802.3 divided the stipulation of Gigabit Ethernet into two Task Forces, 802.3z and 803.3ab. Their physical specifications are listed in Table 2.4.

Task Forces Specification name Description

1000BASE-CX 25 m 2-pair Shielded Twisted Pairs (STP) with 8B/10B encodingIEEE 802.3z (1998)

1000BASE-SX Multi-mode fiber of short-wave laser with 8B/10B encoding

9 The IEEE 802.3 names its new Task Force in an alphabetical order. After IEEE 802.3z, the subsequent new Task Forces are named 802.3aa, 802.3ab, and so on.

37

1000BASE-LX Multi- or single-mode fiber of long-wave laser with 8B/10B

encoding

IEEE 802.3ab (1999)1000BASE-T 100 m 4-pair Category 5 ( or better) Unshielded Twisted Pairs

(UTP) with 8B1Q4.

Table 2.4 Physical specifications of Gigabit Ethernet

A difficulty in Gigabit Ethernet design is the distance restriction induced by CSMA/CD, as introduced in Section 2.3.2. For 10 Mb/s and 100 Mb/s Ethernet, this would not be a problem. The limitation is about 200 m for copper connection in 100 Mb/s Ethernet, which is enough for most configurations. The limitation is even higher for 10 Mb/s Ethernet. However, Gigabit Ethernet is ten times faster to transmit a frame than 100 Mb/s Ethernet, making the distance restriction ten times shorter. A restriction of about 20 m is unacceptable for many network deployments.

The IEEE 802.3 Standard appends a series of extension bits after a frame. The extension bits can be any non-data symbols in the physical layer. This technique, called carrier extension, in effect extends the length of a frame without changing the minimum frame size. Their length, as specified in the Standard, is 4096 bits – frame size. The extension bits are for CSMA/CD purpose only, and will be discarded silently by the receiver.

Although carrier extension addresses the problem, the data throughput can be low because the transmission channel is mostly occupied by the extension bits if the frames are short. The solution is to allow the transmitter to transmit the next frame, if any, without extension bits, by filling the IFG with carrier. Because the IFG between two successive frames is occupied with carrier, the transmission channel is not relinquished by the transmitter. The transmitter can transmit one or more frames following the first frame, as long as it has more, up to a limit. This technique is called frame bursting. The scenario is depicted in Fig. 2.14. The maximum length in the bursting is 65536 bits.

First frame with extension bits IFG Frame 2 IFG Frame 3 IFG Frame n

Figure 2.14 Frame bursting

Both carrier extension and frame bursting complicates the MAC design.

Besides, the throughput is not good despite the solutions. In contrast, full duplex Ethernet does not need CSMA/CD at all, making these solutions unnecessary. Its implementation is simpler and the throughput is much higher. Why do we bother

38

to implement half-duplex Gigabit Ethernet if it is not necessary? As the advance of ASIC technology, switched networks are no longer much more expensive than shared networks. For the deployment of Gigabit Ethernet, it is the performance rather than the cost that is of concern. The market has proved the failure of half-duplex Ethernet. Only full duplex Ethernet products exist on the market. 10 Gigabit Ethernet

Just like Moore’s law stating the power of microprocessors doubles every 18 months, the speed of Ethernet has also grown exponentially in recent years. Not long after 100 Mb/s Ethernet Standard was approved in 1995 will we soon see the 10 Gigabit Ethernet Standard comes out in 2002. Fig. 2.15 lists the timetable of this new standard. Note that commercial products have emerged in the market in 2001, before the final approval of the standard.

study group IEEE 802.3ae 802.3 ballot sponsor ballot standard

1999 2000 2001 2002

Figure 2.15 The timetable of the 10 Gb/s Ethernet Standard

The new 10 Gigabit Ethernet is developed by the IEEE 802.3ae Task Force and bears the following features: Full duplex only: The IEEE people learned a lesson from the development of Gigabit Ethernet. Now, only full duplex mode is in the 10 Gigabit Ethernet. Half duplex mode is no longer considered. Optical fiber only: Unlike Gigabit Ethernet, it is difficult to transmit at 10 Gigabit over copper wires. Only optical fibers are used as the transmission media. Compatibility with past standard: The frame format and the MAC operations remain unchanged, making the interoperability with existing products rather easy. Move toward the WAN market: Since Gigabit Ethernet has moved toward the MAN market, 10 Gigabit Ethernet will go further into the WAN market. On one hand, the longest target distance in the new standard is aimed at 40 km. On the other hand, a WAN PHY is defined to interface with the existing SONET infrastructure. We will talk more about the WAN PHY below.

Because SONET is still a widespread WAN technology and the OC-192 operates at a rate very close to 10 Gigabit, the IEEE 802.3ae comes with an optional WAN PHY besides the LAN PHY. Note that both PHYs have the same transmission media, and hence the same transmission distance. The difference is that the WAN PHY has a WAN Interface Sublayer (WIS) in the Physical Coding

39

Sublayer (PCS). The WIS is a framer that maps an Ethernet frame into a SONET payload. This makes attaching Ethernet to SONET devices easy. There is no requirement that the WAN PHY should be deployed in the WAN. In a WAN of pure Ethernet, only the LAN PHY is needed.

The physical specifications of the 10 Gigabit Ethernet are listed Table 2.5.

Physical medium Fiber type Target distance (m) 850 nm serial Multi mode 65 1310 nm WWDM Multi mode 300 1310 nm WWDM Single mode 10,000 1310 nm serial Single mode 10,000 1310 nm serial Single mode 40,000

Table 2.5 Physical specifications in the 10 Gigabit Ethernet Ethernet in the First Mile

We have Ethernet dominant in the LAN. We expect Ethernet will dominate in the WAN. We enjoy broad bandwidth both in the LAN and WAN. However, what do you have when you want to access to the Internet at home? You’ve got choices of traditional modems, ADSL, cable modems, and so on. Still, these technologies are slow and expensive. The segment of subscriber access network, often also called the first mile or last mile, becomes the bottleneck. As the population of subscriber access network grows very rapidly, the potential market becomes highly noticeable.

A new effort in the new IEEE 802.3ah Ethernet in the First Mile (EFM) Task Force is starting to define a new standard for this market. The expected timetable is listed in Fig. 2.16.

study group IEEE 802.3ah 802.3 ballot sponsor ballot standard

2001 2002 2003 Figure 2.16 The timetable of Ethernet in the First Mile Standard

Ethernet is a very mature and reliable technology. High volume of Ethernet

devices has been existent in the market for years, making Ethernet very cheap. If Ethernet could be everywhere, no protocol conversion is needed, which also helps to reduce the total cost. All in all, the standard is expected to provide a cheaper and faster technology in the potentially broad first mile market. Ethernet is going toward the goal to be ubiquitous. The development goals of the new standard include the following:

40

New topologies: The requirements for subscriber access network include point to point on fiber, point to multipoint on fiber, and point to point on copper. The standard aims at meeting these requirements. New PHYs: Inevitably, this standard needs to define new PHYs. The current goals are

extending temperature range for the current 1000BASE-X extending the distance limitation to at least 10 km long for the current

1000BASE-X of single-mode optical fiber. defining a new PHY for Passive Optical Network (PON) to at least 10 km

long for single-mode fiber at 1 Gb/s or more. A PON is a point to multipoint optical link. The term “passive” means no components in a PON needs electrical power except at the ends. A fan-out of at least 16 is expected.

defining a new PHY for non-loaded voice grade copper at 10 Mb/s or more for at least 2500 ft. To achieve this goal, several proposals, including VDSL, 100BASE-CU, 10BASE-T4, are still competing to become the standard. Just watch it!

Far-end Operations, Administration, and Maintenance (OAM): The reliability is very important in subscriber access network. For easy OAM, the standard will define new methods of remote failure indication, remote loopback, and link monitoring. A critical point to success is time to market. To speed up the standardization process, a possible way is to leverage existing standards for the PHY, as what IEEE 802.3 did for 100BASE-X and 1000BASE-X. 100BASE-X uses a PHY modified from the FDDI Standard, and 1000BASE-X has its PHY from the Fiber Channel Standard. For Ethernet in the First Mile, some candidates, say VDSL, are under consideration. However, since the standardization process is still in its beginning, we do not know what the choice will be in the final stage.

The study group just closed its task and the first meeting of IEEE 802.3ah will be held in October 2001. For more information, see the web site at http://www.ieee802.org/3/efm/index.html.

2.3.4 Ethernet switch

Network administrators usually have the need to connect separate LANs into an interconnected network. The reason for interconnection may be extending the extent of a LAN or administrative purposes. An interconnection device operating in the data-link layer is called a MAC bridge, or simply bridge. A bridge

41

interconnects LANs as if they were in the same LAN. Its operation has been standardized in the IEEE 802.1D Standard. We will introduce the ins and outs below.

Almost all bridges are transparent bridges. A bridge is transparent because all stations on the interconnected LANs are unaware of its existence. The transmitting station simply tags the destination MAC address and sends it out as if the destination were on the same LAN. The bridge will automatically forward this frame. Another category of bridges is source-routing bridges, which is mostly found in Token Ring and sometimes in FDDI. The station should discover the route and tag forwarding information in the frame to instruct the bridge how to forward. As the Ethernet dominates the LAN market, this category is seldom seen, so we introduce only transparent bridge in this subsection.

The bridge has ports to which each LAN is connected. Each port operates in promiscuous mode, which means it receives every frame on the LAN attached to it, no matter what the destination address is. If the frame has to be forwarded to other ports, the bridge will do it accordingly. Bridge Operation

The mystery is how the bridge knows it should forward the incoming frame and to which port it should forward. We illustrate the bridge operation with Fig. 2.17 below.

Figure 2.17 Bridge operation

The bridge keeps an address table, also called forwarding table, to store the mapping of MAC address to port number. Initially, the address table is blank. The bridge knows nothing about the location of stations. Suppose Station 1 with MAC

Station 2

Station 1

Station 2 entry here!

42

address 00-32-12-12-6d-aa transmits a frame to Station 2 with MAC address 00-1c-6f-12-dd-3e. Because Station 1 is connected to Port 3 of the bridge, the bridge will receive the frame from Port 3. By checking the source address field of the frame, the bridge learns the MAC address 00-32-12-12-6d-aa is located on the segment Port 3 is connected to. It keeps the fact in the address table. However, it still does not know where the destination address 00-1c-6f-12-dd-3e is located. To make sure the destination can receive the frame, it simply broadcast to every port other than the port from which the frame comes. Suppose some time later, Station 2 transmits a frame to somewhere. The bridge will learn its address comes from Port 2 and will keep this fact in the address table as well. Subsequent frames destined to Station 2 will be forwarded to Port 2 only. No broadcast is necessary. This greatly saves the bandwidth of all other segments and reduces the probability of collisions. Of course, if Station 2 always keeps silent, the bridge will not know where it is and every frame destined to Station 2 will be broadcast. This situation is unlikely to happen. A typical scenario is that Station 2 responds something when after receiving a frame destined to it. The bridge can learn where Station 2 is from the response.

Sometimes, a station may be moved to another location or removed, making its entry in the address table stale. To conquer this problem, an aging mechanism is applied. If a station has not been heard for a given period of time, its entry will be expired. Subsequent frames destined to it will be flooded again until its existence is relearned.

In case that the destination address is a multicast or broadcast address, the bridge will forward the frame to all ports except the source. It is wasteful to flood the frame, however. To address the problem, the IEEE 802.1D Standard specifies a GMRP, short for GARP Multicast Registration Protocol. It is a subset of Generic Attribute Registration Protocol (GARP). When this protocol is enabled, the bridge can register the requirement from the intended receivers of multicast addresses. The registration information will be propagated among bridges, and thus all intended receivers are identified. If there is no multicast demand on a given path, a multicast pruning is performed to cut off this path. Through this mechanism, multicast addresses are forwarded to only those paths in which there are intended receivers.

Note that in Fig. 2.17, there is a device called repeater hub, or often simply hub. This device is a Layer 1 device, which means it simply restores signal amplitude and timing, propagates signal to all other ports other than the port the frame comes from, but knows nothing about the frame. After all, frames are nothing more than a series of encoded bits to the physical layer.

43

Cut-through vs. Store-and-Forward

Recall that the destination address (DA) field is the first field in the frame except the Preamble and SFD fields. By looking up the DA in the address table, the bridge can determine where to forward the frame. The bridge can start to forward the frame out of the destination port before the frame is received completely. Such operation is called cut-through. On the contrary, if the bridge only forward after the frame is received completely, its operation is called store-and-forward.

Aha! The title of this subsection is “Ethernet switch,” but we are talking about bridge so far. What is going on? There is a historical reason. It is time to tell the answer. Before 1991, the device is called bridge, both in the IEEE Standard and in the market. Early bridges were all implemented in store-and-forward manner. In 1991, Kalpana Corporation marketed the first cut-through bridge, under the name “switch” to differentiate themselves from store-and-forward bridges. It declared lower latency because of the cut-through operation. Arguments were raised then among proponents of store-and-forward and cut-through approaches. We summarize the comparisons of these two mechanisms in Table 2.6.

Store-and-forward Cut-through

Transmitting time Transmit a frame after receiving completely

May transmit a frame before receiving completely10

Latency Slightly larger latency May have slightly smaller latency

Broadcast/Multicast No problem for broadcast or multicast frames

Generally not possible for broadcast or multicast frames

Error checking Can check FCS in time May be too late to check FCS

Popularity Mostly found in the market Less popular in the market

Table 2.6 Comparisons of store-and-forward and cut-through Bridge vs. Switch

Following Kalpana’s convention, bridges are marketed under the name “switch,” no matter their operation is store-and-forward or cut-through. On the other hand, the name is still “bridge” in the IEEE Standard. The IEEE 802.3 Standard explicitly underlines that the two terms are synonyms. Despite under the name “switch,” most switches provide only store-and-forward, or both that are

10 If the LAN of the outgoing port or the output queue is occupied by other frames, a frame still cannot be forwarded even in a cut-through switch.

44

configurable today. There is really no significant benefit in the cut-through design, as compared in Fig. 2.20. We start to use the term “switch” when convenient below. In fact, the term “switch” is so widely used on devices making forwarding decision based on the information from upper layers. That’s why we see L3 switch, L4 switch, and L7 switch today. Spanning Tree Protocol

As the topology of a bridged network becomes large and complex, network administrators may inadvertently create a loop in the topology. This situation is undesirable because frames can circulate around the loop and the address table may become unstable. For example, consider the following disaster. Suppose two 2-port switches form a loop and a station broadcasts a frame onto the loop. Each switch will forward the broadcast frame to the other upon receiving it, making it circulate around the loop indefinitely.

The above situation is undesirable. To address the problem, IEEE 802.1D stipulates a Spanning Tree Protocol (STP) to eliminate loops in a bridged network. For its simplicity in implementation, almost all switches support this protocol. Despite this, the specification takes 51 pages in the standard document. We only explain the principle of STP operation with the example in Fig. 2.18. This example is a little complex. We list the procedure below. For serious readers who intend learn the details, we encourage them to read the standard.

Figure 2.18 A bridged network with loops

1. Initially, each switch and port is assigned an identifier. The identifier is composed of a manageable priority value and switch address (or port number for port identifier). For simplicity, we use 1 to 7 as the identifiers in this

45

illustration. 2. Each link is specified a cost. As a rule of thumb, the cost can be inversely

proportionally to the link speed. For simplicity, we assume all link costs are 1 here.

3. The switch with the least identifier serves as the root. It is elected through exchanging frames of configuration information among switches.

4. Each LAN is connected to a port of some switch in an active topology. The port which the LAN receives frames from the direction of the root and transmits frames toward the root is called the Designated Port (DP). This switch is called the Designated Bridge (The standard refers to a switch as a bridge). The port that the switch receives frames from the root is called the Root Port (RP).

5. Periodically, configuration information is propagated down from the root on Bridge Protocol Data Unit (BPDU). The destination address of BPDU is a reserved multicast address for switches, 01-80-C2-00-00-00. The BPDU frame contains information such as the root identifier, the transmitting switch identifier, the transmitting port identifier, and the path cost from the root.

6. Each switch may configure itself by computing the information carried in the received BPDUs. The configuration rules are

If the switch finds itself can provide a lower path cost by comparing with the path cost advertised in BPDUs, it will attempt to be a designated bridge by transmitting BPDUs with lower path cost.

In case of ambiguity, e.g., equal path cost, the switch or port with the least identifier is selected as the designated bridge (port).

If the switch finds itself has lower identifier than that of the current root, it will attempt to become the new root by transmitting BPDUs in which the root identifier is that of itself.

Note that the switch will not forward incoming BPDUs, but may create new BPDUs to carry new states to others.

7. All ports other than DPs and RPs are blocked. A blocked port is not allowed to forward or receive data frames. However, it keeps listening to BPDUs to see if it can be active again. The result is as indicated in Fig. 2.18. The readers are encouraged to trace

the procedure themselves. The protocol is so useful that it dynamically updates the spanning tree according to possible topological changes. Virtual LAN

Once a device is connected to a LAN, it belongs to that LAN. That is, the deployment of LANs is completely determined by physical connectivity. In some

46

applications, we need to build logical connectivity on top of physical deployment. For example, we may need some ports in a switch belong to one LAN, and other ports belong to another. Further, we may need ports across multiple switches belonging to the same LAN, all other ports belonging to another LAN. Generally, we need flexibility in the network deployment.

Virtual LAN (VLAN) addresses the above problem by providing logical grouping of LANs. Administrators can simply work with management tools without changing physical connectivity. Additionally, with VLAN separation, we can increase security and save bandwidth because traffic, particularly multicast and broadcast traffic, is confined to the VLAN the traffic belongs to. For example, a broadcast frame or a frame with an unknown unicast destination address will be seen on all ports of a switch without VLAN. It still may consume bandwidth on unintended ports and malicious users can monitor it. By dividing the ports of a switch into several VLANs, the frames mentioned above will be confined to a VLAN.

We give a practical example below to make the readers appreciate the usefulness of VLAN. Consider we have two IP subnets: 140.113.88.0 and 140.113.241.0. Each has several stations. If we want to connect these two IP subnets with a router, we may deploy the network in the manner depicted in Fig. 2.19.

Figure 2.19 a router deployment without VLAN

If we configure the switch with two VLANs instead, only one switch is needed.

The router is connected to a port that belongs to two VLANs, and is configured with two IP addresses, one for each subnet. The router in this situation is called the one-armed router, as illustrated in Fig. 2.20.

47

Nowadays, many switches have the ability to serve as a normal router. They can forward frames based on layer 3 information. Some of them also implement routing protocols (See Chapter 3 for routing protocols). With VLAN, administrators can arbitrarily group ports into several IP subnets. This is very convenient for network administration.

For the importance of VLAN, the IEEE 802.1Q Standard specifies a set of protocols and algorithms to support the VLAN operation. This standard describes the architectural framework for VLAN in respect of configuration, distribution of configuration information, and relay. The first is self-explanatory. The second is concerned with methods that allow the distribution of VLAN membership among VLAN-aware switches. The third deals with how to classify and forward incoming frames, and the procedure to modify the frames by adding, changing, removing tags. We will soon discuss the concept of tag below.

Figure 2.20 a one-armed router The IEEE 802.1Q Standard does not enforce the way how frames are

associated to VLANs. The VLAN membership can be based on ports, MAC addresses, IP subnets, protocols, and applications. Each frame can associate with a tag that bears the identifier of a VLAN so that the switch can identify its VLAN association quickly without complicated field classification. The tag slightly changes the frame format, however. The format of a tagged frame is depicted in Fig. 2.2111. Note that there are 12 bits in the VLAN identifier. Giving one identifier is reserved unused and another one is used to indicate a priority tag (see below),

11 Note that VLAN is not confined to Ethernet. The standard also applies to other LAN standards, say Token Ring. However, since Ethernet is the most popular, we discuss Ethernet frame here.

48

a maximum number of 4094 (i.e., 212-2) VLANs are allowed. Priority

If the load in a LAN is high, the users will perceive larger latency. However, some voice or video applications are time-sensitive. Their quality will be deteriorated with larger latency. Traditionally, LAN technology solves the problem with over-provisioning. That is, providing more bandwidth than needed. This solution is feasible because high bandwidth is inexpensive in LAN. But in case of short-term congestion, the traffic may temporarily exceed the available bandwidth. Higher priority can be assigned to frames of critical applications to guarantee they receive better service.

Ethernet inherently does not have the priority mechanism. As of IEEE 802.1p, which was later integrated into IEEE 802.1D, a priority value can be optionally assigned to an Ethernet frame. This value is also carried in the tagged frame, as illustrated in Fig. 2.21.

A tagged frame has four more bytes added into it. They are a type field of two bytes that indicates a VLAN protocol type (the value = 0x8100) and a tag control information field of another two bytes. The latter is further divided into three fields: priority, Canonical Format Indicator (CFI), and VLAN identifier. Note that a tagged frame does not necessarily carry VLAN information. The tag can only contain the priority of the frame, which was defined in IEEE 802.1p. The VLAN identifier helps the switch to identify the VLAN to which the frame belongs. A switch can easily identify the VLAN membership through this field. The CFI field looks mysterious. It is a one-bit field that indicates whether the possible MAC addresses carried in the MAC data is in Canonical format. We do not go into the detail of Canonical form here. The interested readers are referred to Clause 9.3.2 in the IEEE 802.1Q document.

Preamble SFD DA SA VLAN

protocol ID

Tag control T/L Data FCS

bytes 7 1 6 6 2 2 2 42 – 1500 4

priority CFI

VLAN identifier

bits 3 1 12 Figure 2.21 Format of a tagged frame

Because there are three bits in the priority field, eight priorities are allowed in

49

the priority mechanism. The suggested mapping of priority values to traffic types in the standard is listed in Table 2.7. By identifying the tag values, the switch is able to classify the incoming values and arrange appropriate queue services to meet the user’s demand.

Priority Traffic type 1 Background 2 Spare

0(default) Best effort 3 Excellent effort 4 Controlled load 5 < 100 ms latency and jitter 6 < 10 ms latency and jitter 7 Network control

Table 2.7 suggested mapping of priority values and traffic types Link Aggregation

The final issue we would like to introduce in this section is link aggregation. Multiple links can be aggregated as if they were a pipe of larger capacity. For example, users can aggregate two gigabit links into a two gigabit link if larger link capacity is desired. They do not have to wait for ten gigabit Ethernet products. Even if new products come out, it may be not economical to buy them. Link aggregation brings flexibility in network deployment.

Link aggregation was originally a technique of Cisco, dubbed EtherChannel, or often referred to as Port Trunking, and was later standardized in the IEEE 802.3ad in 2000. The operation is not confined to links between switches. Links between switch and station, and between station and station can also be aggregated. The principle of operation is simple: the transmitter distributes frames among aggregated links, and the receiver collects these frames. However, some difficulties complicate the design. For example, consider the case in which a long frame is followed by several short frames. If the long frame is distributed to one link, and the short frames are distributed to another. The receiver will receive these frames out of order. Although an upper layer protocol, such as TCP can deal with out of order frames, it is less efficient to do so. The ordering of frames in a flow must be maintained in the data-link layer. A flow may need to move from one link to another for well load-balancing or because of link failure. To meet these requirements, a Link Aggregation Control Protocol (LACP) is designed. For details, we refer the readers to Clause 43 in the IEEE 802.3 Standard.

2.4 Wireless links

Wireless links are appealing to many people. With wireless links, people are

50

free from the constraints of wires here and there, which may be inconvenient or too expensive to deploy. However, wireless links feature different characteristics from wired links, imposing special requirements on the protocol design. We list these characteristics below: Less reliability: Signals are propagated without protection on the air, making the transmission easily impaired by outside interference, path loss, multi-path distortion, etc. Outside interference comes from nearby wireless signal sources. Microwave ovens and Bluetooth devices are possible sources because they all operate in the unlicensed ISM (Industrial, Scientific, and Medical) band. Path loss is attenuation the signal undergoes as it propagates on the air. The attenuation is worse because the signal is inherently distributed over the air rather than concentrated on a wired link. Multi-path distortion results from delayed parts of the signal because they travel through different paths to the receiver. There are possibly different paths in the transmission because parts of the signal bounce off physical obstacles on the way. More mobility: Because there is no wire that limits the mobility of a station, the network topology may vary dynamically. Note that mobility and wireless different concepts although they are often referred to together. Wireless is not necessary for mobility. For example, a mobile station can be carried to a location and then plugged to a wired network. Mobility is also not necessary for wireless. For example, two high buildings can communicate with wireless relay devices because a wire between them is too expensive. This is not uncommon in network deployment. Less power: A mobile station is often battery powered. Stations may sometimes be put into sleep to save power. Transmitters shall buffer the data until the receiver awakens to receive them. Less security: Data propagated on the air are easily eavesdropped. All stations within the transmission range can listen to the data. Optional encryption and authentication mechanisms are provided to keep the data more secure from outside threats.

In this section, we will introduce two noticeable wireless link protocols: IEEE 802.11 and Bluetooth. The former has become the standard of wireless LAN, and the latter is designed for short-range connectivity. We will conclude this section with the comparison of these two technologies and discuss their coexistence issues.

2.4.1 Basics of IEEE 802.11

51

Evolution The IEEE 802.11 Working Group was established in 1990. Its goal is to

develop Medium Access Control (MAC) method and physical layer specifications to meet the requirements of wireless local area network. The process was so long that the first version of standards did not appear until 1997. Initially, three kinds of PHYs, infrared, Direct Sequence Spread Spectrum (DSSS), and Frequency-Hopping Spread Spectrum (FHSS), are specified to allow transmission at 1 Mb/s and 2 Mb/s. Spread spectrum techniques are intended to make signal robust to outside interference. It was later revised in 1999. Two amendments, 802.11a and 802.11b are also standardized in that year. IEEE 802.11b extends the DSSS system up to a higher data rate at 5.5 Mb/s and 11 Mb/s. IEEE 802.11a specifies a new Orthogonal Frequency Division Multiplexing (OFDM) operating at 5 GHz band, as opposed to 2.4 GHz band in previous standards. The data rate is increased significantly up to 54 Mb/s. However, these two standards are not compatible. IEEE 802.11b products operating at 11 Mb/s has been popular in the market. Some vendors, say Intel, has started to market IEEE 802.11a products. At the time of this writing, IEEE 802.11 projects under development have reached 802.11i. Some major ones of them are 802.11e for QoS, 802.11g for higher data rate at 2.4 GHz band, and 802.11i for security. The development is still very active so far. Building Blocks

The basic building block of an 802.11 LAN is a Basic Service Set (BSS). A BSS is composed of stations capable of MAC and PHY that conform to the IEEE 802.11 Standard. A minimum BSS contains only two stations. A standalone BSS is called an Independent BSS (IBSS), or more often than not, referred to as an Ad hoc network because this type is often formed without planning in advance. Multiple BSSs can be connected through a Distribution System (DS). The IEEE 802.11 Standard does not mandate what the DS should be. Ethernet network is the DS we can find most often. A DS and a BSS are connected through an Access Point (AP). Such an extended network structure is called an Infrastructure. These building blocks are illustrated in Fig. 2.22.

The layering in the IEEE 802.11 is depicted in Fig. 2.23. As we have mentioned, the IEEE 802.11 PHYs consist of infrared, DSSS, FHSS, and OFDM. Above them is the MAC sublayer, which we will introduce soon. We will focus on the IEEE 802.11 MAC in this section. For issues on PHY, we encourage interested readers to refer to the resources listed in Section 2.7 or search more on the Internet.

52

Figure 2.22 IEEE 802.11 building blocks

802.2 LLC

802.11 MAC

Data-link layer

FHSS DSSS IR OFDM Physical

layer 802.11 MAC design shall take care of this case. FHSS: Frequency Hopping Spread Spectrum

DSSS: Direct Sequence Spread Spectrum

OFDM: Orthogonal Frequency Division Multiplexing

IR: Infrared

Figure 2.23 Layering in the IEEE 802.11

2.4.2 IEEE 802.11 MAC

An obvious distinction between the IEEE 802.11 MAC and the IEEE 802.3 MAC, a typical representative of wired network, is that collision detection is difficult to implement. The cost of full-duplex RF is higher and there are potentially hidden stations that make collision detection fail. The latter is known as the hidden terminal problem, as we illustrated in Fig. 2.24. Therefore, the receiver should

53

respond with an acknowledgment if the FCS is correct. This is the positive acknowledgment mechanism as we had mentioned in Section 2.1.3.

In Fig. 2.24, Station A and Station C cannot hear each other because they are located out of each other’s transmission range. However, if they both transmit data to Station B simultaneously, a collision will occur at Station B. Thus, the IEEE

Figure 2.24 The hidden terminal problem

The IEEE 802.11 MAC allocates channels with two major functions:

Distributed Coordination Function (DCF) and Point Coordination Function (PCF). The DCF is mandatory that all IEEE 802.11 conformant stations should follow. The PCF is performed in an infrastructure network. Both coordination functions can coexist within the same BSS.

The philosophy behind DCF is known as Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA). Although the most noticeable difference from the Ethernet MAC is the collision avoidance, the CSMA/CA mechanism has more differences than that.

The same as CSMA/CD, a station must listen before transmitting. If some station is transmitting, the transmission will be deferred until the channel is free. Once the channel is clear, the station will wait for a short period of time, known as interframe space (IFS), before the transmission. Note that during the time of last transmission, there are odds that multiple stations are waiting to transmit. If they all are allowed to transmit after IFS, it is very likely to result in a collision. To avoid possible collisions, the stations have to wait a random backoff time before

54

transmission. The period is determined with the formula: Backoff time = Random value × slot time

In the above formula, the random value is selected randomly from the range from 0 to CW. CW stands for Contention Window, ranging from CWmin to CWmax. CWmin, CWmax, and the slot time, depend on the PHY characteristics. Initially, CW is set to CWmin. The backoff time is decreased by one slot time if the channel is free for an IFS period; otherwise, the time is suspended. When it finally reaches zero, the station starts to transmit. Through the above procedure, collisions can be reduced significantly. We summarize the CSMA/CA procedure in Fig. 2.25.

Fig 2.25 CSMA/CA flow chart

Unlike collision detection, which can stop transmission immediately if a collision is detected, there is no way for a station to find that the frame it transmits is impaired until no acknowledgement is received. The cost of collision is significant if a long frame is transmitted. An optional refinement to reduce the cost is an explicit RTS/CTS mechanism. Before transmitting a frame, the transmitter notifies all stations within its transmission range with a Request to Send (RTS). The receiver responds with a Clear to Send (CTS) frame, which is also noticed by all stations within its transmission range. Both RTS and CTS frames carry their duration fields, telling the other stations to wait for actual data frame transmission and its acknowledgement. This procedure is illustrated in Fig. 2.31. During the reserved period, the other stations inhibit their own transmission and do not need to perform carrier sense physically. Therefore, this mechanism is also called

55

virtual carrier sense. This mechanism has another advantage. In Fig 2.26, C and D cannot sense transmission from each other. If they both intend to transmit simultaneously, a collision will occur. The RTS/CTS mechanism can avoid this situation. Note that this mechanism is only applicable to unicast frame. In case of multicast and broadcast, multiple CTSs from the receivers will result in a collision. Similarly, the acknowledgement frame will not be responded in case of multicast or broadcast.

Figure 2.26 RTS/CTS mechanism

The PCF is exercised by a Point Coordinator (PC) that resides in the AP

within each BSS. Periodically, the PC transmits a beacon frame to announce a Contention-Free Period (CFP). Every station within the BSS is aware of the beacon frame and keeps silent during CFP. The only station is allowed to transmit when it is polled by the PC. Hence, the PC has the authority to determine who can transmit. The polling sequence is left unspecified in the standard.

The DCF and PCF can coexist in the scenario illustrated in Fig. 2.27. The DCF immediately follows CFP, entering a period called Contention Period (CP). Normally, the PC transmits a beacon frame with a CFP repetition period, but the period may be delayed if the channel happens to be busy at the end of the CP.

Figure 2.27 DCF and PCF coexistence The IEEE 802.11 defines the MAC frame format as depicted in Fig. 2.28.

56

Figure 2.28 IEEE 802.11 frame format

The frame format is general. Certain frame type may contain a subset of

these fields. We categorize the frames into three types: 1. Control frames: RTS, CTS, ACK, etc. 2. Data frames: carrying normal data 3. Management frames: Beacon, etc.

To fully cover these types requires deep understanding of every IEEE 802.11 operation. The readers can refer to the standard itself for details.

2.4.3 Bluetooth technology

Look at the cables behind your computer. There are plenty of them. Besides those connecting computer peripherals, we also have cables to connect different kinds of cables. These cables are so cumbersome that it is better to get rid of them.

Bluetooth, named after a Danish king in the tenth century, is the very technology designed to replace cables connecting electronic devices. Between the devices are short-range, usually within 10 m, radio links. To make sure the proliferation of this new technology, the development goal attempts to integrate many functions in a single chip and reduces the price of a chip below five dollars eventually. Bluetooth is a rather new technology. In 1998, five major companies, Ericsson, Nokia, IBM, Toshiba, and Intel, cooperate to create it. A Bluetooth Special Interest Group (Bluetooth SIG), composed of many companies, was formed later to promote and define the new standard.

Bluetooth devices operate at the 2.4 GHz ISM band, the same as most IEEE 802.11 device using frequency hopping. The frequency band ranges from 2.400 GHz to 2.4835 GHz, within which are 79 channels of 1 MHz for frequency hopping. Below and above these channels are guard bands of 2 MHz and 3.5 MHz, respectively. An observant reader may immediately have noticed the possible interference problem if devices of IEEE 802.11 and Bluetooth are close. The coexistence problem is a big issue. We will talk more about this in the end of this

57

subsection. The basic Bluetooth topology is illustrated in Fig. 2.29. Like BSS in the IEEE

802.11, two or more devices sharing the same channel form a piconet. But unlike an IBSS, in which all stations are created equal, there are one master and slaves in a piconet. The master has the authority, say deciding the hopping sequence, to control channel access in the piconet. The slaves can be either active or parked. A master controls up to seven active slaves at the same time. Parked slaves do not communicate, but they still keep synchronized with the master and can become active as the master demand. If a master desired to communicate with more than seven slaves, it tells one or more active slaves to enter into the park mode, and then invites the desired parked slaves to be active. For more devices to communicate simultaneously, multiple piconets can overlap to form a larger scatternet. In Fig. 2.29 below, we see two piconets form a scatternet with a bridge node. The bridge node can be a slave in both piconets or be the master in one piconet. It participates in both piconets in a manner of time-division. Sometimes, it is part of one piconet, and sometimes, it belongs to another.

piconet scatternet

Figure 2.29 The Bluetooth topology For Bluetooth devices to communicate, they must be aware of each other.

Any inquiry procedure is designed for each device to discover the other devices, followed by a page procedure to build up a connection. Initially, all Bluetooth devices are by default in standby mode. A Bluetooth intend to communicate will try to broadcast an inquiry within its coverage area. The devices around it may respond the inquiry with information about themselves, such as addresses, if they would like to. Upon receiving these responses, the inquirer knows information about surrounding devices and become the master in the piconet. Other devices become the slaves. After inquiry, the master sends a unicast message to the destination device. The destination responds with an acknowledgement and thus a connection is established. This is called a page procedure. Some time later, a slave can run the same page procedure, and the role of the master and slave will

58

be exchanged. The process is illustrated in Fig. 2.30. Note that multiple responses from an inquiry may result in a collision. The receiving devices should defer the responses for a random backoff time.

Figure 2.30 Inquiry and Page procedure

A piconet channel is divided into time slots of 625 µs each in which different

hopping frequency occupies. The slot time is a reciprocal of the hop rate, which is 1600 hops/s. These slots are time multiplexed with the same hopping sequence by the communicating master and slave. At the data rate of 1 Mb/s, each slot can ideally carry data of 625 bits. However, some intervals within a slot are reserved for frequency hopping and stabilization. Up to 366 bits can be carried in a slot. Normally, each slot carries a Bluetooth frame. A frame has fields of access code of 72 bits, header information of 54 bits, and the payload of variable length. With payload of only 366 – 72 – 54 = 240 bits (30 bytes) carried in a time slot that ideally carries 625 bits, the efficiency is not good. To improve efficiency, a frame can occupy up to five consecutive slots.

A Bluetooth connection has two options in employing the time slots to communicate. The first is the Synchronous Connection-Oriented link (SCO link), which reserves time slots regularly for time-bounded information, such as voice. For example, a telephone-grade voice has a sample rate of 8 KHz, each sample generating one byte. In other words, a byte is generated every 0.125 ms. Because a frame can carry 30 bytes in each slot, one slot should be reserved to carry voice every 3.75 ms. Each time slot has a length 625 µs, meaning one out of six slots is reserved.

The second option is the Asynchronous Connection-Less link (ACL link). Time slots are allocated on demand rather than being reserved. The master is in

59

charge of the allocation to one or multiple slaves. In this way, collisions from slaves are avoided and the master can control the Quality of Service (QoS) requirement in the link.

The protocol stack in the Bluetooth specification is depicted in Fig. 2.31. We describe the function of each module shortly on the right of the figure. We leave the detail to the specification. Readers can download it from the URL given in Section 2.7.

Figure 2.31 The Bluetooth protocol stack

Bluetooth and the IEEE 802.11 are designed for different purposes. The IEEE

802.11 intends to be a wireless LAN standard, while Bluetooth is designed for the wireless personal area network (wireless PAN, or WPAN). A comparison is listed in Table 2.8 below.

Currently, the IEEE 802.15 WPAN Working Group and the Bluetooth SIG are cooperating to improve the Bluetooth Standard. Moreover, Task Group 2 in the IEEE 802.15 focuses on addressing the coexistence problem because of possible interference. Although there are arguments as to the success of the Bluetooth, many people expect optimistically coexistence of these two standards.

IEEE 802.11 Bluetooth

Frequency 2.4 GHz (802.11, 802.11b)5 GHz (802.11a)

2.4GHz

Data rate 1, 2 Mb/s (802.11) 5.5, 11 Mb/s (802.11b) 54 Mb/s (802.11a)

1 Mb/s

Range round 100 m within 10 m

Power consumption higher (with 1W, usually 30 – 100 mW)

lower (1 mW – 100 mW, usually about 1mW)

60

PHY specification Infrared OFDM FHSS DSSS

FHSS

MAC DCF PCF Slot allocation

Price higher lower

Major application Wireless LAN Short-range connection

Table 2.8 A comparison of Bluetooth and IEEE 802.11

2.5 Device drivers 2.5.1 An introduction to device drivers

One of the main functions of an operating system is to control I/O devices.

The I/O part in the operating system can be structured to four layers as follows. Note that the interrupt handler can also be thought as part of the driver.

Figure 2. Structure of I/O software

All the device-dependent codes are embedded in the device drivers. The device drivers issue commands to the device registers and check if they are carried out properly. Thus, the network device driver is the only part of the operating system that knows how many registers the network adaptor has and what they are used for.

In general terms, the job of a device driver is to accept abstract requests from the device-independent software above it, and to handle these requests by issuing commands to device registers. After commands have been issued, one of two situations will happen. One is that the device driver blocks itself until the interrupt comes in to unblock it. The other is that the operation finishes immediately, so the driver does not need to block.

User processes

Device-independent OS software

Device driver

Interrupt handlers

Device

I/O replyI/O request

I/O functions

I/O calls, spooling

Naming, protection, allocation

Setup device registers, check status

Wakeup driver when I/O completed

Perform I/O operations

61

2.5.2 How to write a device driver in Linux

Before a device driver can communicate with a device, it must initialize the environment so that everything gets ready. The action of initialization includes probing I/O ports for communicating with device registers, and probing IRQs for correctly installing the interrupt handler.

Probe Hardware

The method of probing hardware in a driver is different according to the types of bus architecture. PCI devices are automatically configured at boot time. The device driver, then, must be able to access configuration information in the device in order to complete the initialization. This happens without the need to perform any probing. However, the device drivers for ISA devices have to probe themselves.

Let’s see the PCI devices first. In Linux kernel version 2.4, the I/O ports of PCI devices have been integrated in the generic resource management. We can use the following functions to get the I/O ports of a device in the device driver: unsigned long pci_resource_start(struct pci_dev *dev, int bar); struct resource *request_region (unsigned long start, unsigned long len, char* name); void release_region (unsigned long start , unsigned long len);

First, we use pci_resource_start() to get the base address. Then, we use request_region() to reserve the I/O ports. Finally, the driver should call release_region() to release the ports when it finishes. As far as interrupts are concerned, PCI is easy to handle. By the time Linux boots, the firmware has already assigned a unique interrupt number to the device configuration register named PCI_INTERRUPT_LINE, which is one byte wide. We can use the following function to get the IRQ number of a device in the device driver: int pci_read_config_byte(struct pci_dev *dev, int where, u8 *ptr); The where argument should be PCI_INTERRUPT_LINE, and the ptr argument is the pointer to the IRQ number.

Now, let’s look at the ISA devices. If we want to get the I/O ports of an ISA device in a device driver, the following procedure must be done: 1. int check_region(unsigned long start, unsigned long len); This function is used to see if a range of ports are available for allocation. 2. This probe routine probe_hardware() is to make sure the device exists. It is not provided by the kernel, but instead it must be implemented by driver writers.

62

3. Use request_region() to actually allocate the ports. 4. Use release_region() to release the ports when it finishes.

If we want to get the IRQ number of an ISA device in the device driver, we can use the following functions: unsigned long probe_irq_on(void); int probe_irq_off(unsigned long);

The function probe_irq_on() returns a bit mask of unassigned interrupts. The driver must preserve the returned bit mask and pass it to probe_irq_off() later. After probe_irq_on(), the driver should arrange for its device to generate at least one interrupt. After the device has requested an interrupt, the driver calls probe_irq_off(), passing as argument the bit mask previously returned by probe_irq_on(). The function probe_irq_off() returns the number of the interrupt that was issued after probe_irq_on(). Interrupt Handling

Data transferred to or from hardware device might experience delay for some reason. Therefore, the device driver should buffer these data for a while. A good buffering mechanism is interrupt-driven I/O, which means the input buffer is filled at interrupt time by an interrupt handler and is consumed by the process later. Similarly, the output buffer is filled by the process and is consumed at interrupt time by an interrupt handler later. For the most part, a device driver only needs to register an interrupt handler for its device, and handle them properly when they arrive. We use following functions to register (install) and free (uninstall) an interrupt handler. #include <linux/sched.h>; int request_irq(unsigned int, void* (int, void *, struct pt_regs *), unsigned long, const char * ,void *); void free_irq (unsigned int , void*);

When an interrupt happens, a series of events listed in Figure 1 may occur in the system. 1. Hardware stacks program counter, etc. 2. Hardware loads a new program counter from the interrupt vector. 3. Assembly language procedure saves registers. 4. Assembly language procedure sets up a new stack, and calls a C procedure to do the actual work of processing the interrupt. 5. C language procedure handles actual interrupt routine, wakes up the process,

may call schedule(), and finally returns to the assembly language.

63

6. Assembly language procedure starts up current process.

Item 3 to 6 belong to the ISR process, and Item 5 is the interrupt handler.Old versions of the Linux kernel took great pains to distinguish between “fast” and “slow” interrupts. Fast interrupts are those that can be handled very fast, whereas slow ones will take much longer time. In function request_irq(), the “flags” argument can be set to SA_INTERRUPT for installing a fast handler. However, in modern kernels, fast and slow interrupts are almost the same. Below are the comparisons between fast and slow interrupts:

Functions Fast interrupt Slow interrupt

Disable interrupt reporting in the microprocessor when the handler runs

Yes No

Disable interrupt being serviced in the interrupt controller when the handler runs

Yes Yes

Call ret_from_sys_call() after the ISR finishes. No Yes

Figure 2. Compare between fast and slow interrupts The job of an interrupt handler does the following important things:

Consider the meaning of the interrupt Wake up the process waiting the interrupt to be completed If part of the service routine takes time, we can use the “bottom half”

mechanism to handle, which will be discussed later.

There are some restrictions on what an interrupt handler can do because it runs at the interrupt time. An interrupt handler cannot transfer data to or from the user space, because it does not execute in the context of a process. Also, it cannot do anything that will make itself sleep, such as calling sleep_on(). There are three arguments passed to an interrupt handler: irq, dev_id, and regs. The interrupt number, int irq, can be used as a log message. The second argument, void *dev_id, is a pointer to the device. When we use shared interrupts (e.g., two interrupt handlers share an IRQ number), the shared handler can use dev_id to recognize its own interrupt. The last argument, struct pt_regs *regs, is rarely used. It holds the processor context before the processor enters interrupt handler, so it can used for monitoring and debugging.

One of the main problems with interrupt handling is how to perform long tasks

64

within an interrupt handler. There is often much work to do in response to a device interrupt, but interrupt handlers need to complete quickly and not keep interrupts too long. Obviously, these two will conflict with each other. Linux resolves this problem by splitting the interrupt handler into two halves. One is top half, which is the routine that actually responds to the interrupt. And, it is also the one that we use request_irq() to register with. The other one is bottom half. It handles the part that takes time of a task. And it is scheduled by the top half to be executed at a safer time, which means the requirement of execution time is not so critical. The Linux kernel has two different mechanisms that may be used to implement bottom-half processing. They are BH (also called bottom half) and tasklets. The BH implementation is the older one, and it is implemented with tasklets in kernel 2.4. Tasklets were introduced in the 2.3 development series, and they are now the preferred way to do bottom-half processing. Despite this, tasklets are not portable to earlier kernels. So, if the portability is a concern, BH is preferable.

The following functions are useful for using tasklets: DECLARE_TASKLET(name, function, data); tasklet_schedule(struct tasklet_struct *t);

For example, if you write a function func() to be used as a bottom-half routine. The first step is to declare the tasklet by DECLARE_TASKLET(task,func,0) which task is the name given to the tasklet. Then you have to schedule the tasklet by tasklet_schedule(&task). The actual tasklet routine, task, will be executed shortly at the system’s convenience. As mentioned earlier, this routine performs the bulk of the work of handling the interrupt.

In the BH implementation, if you want to schedule a bottom half for running, you can use the function below: void mark_bh(int nr); Here, nr is the number of the BH to be activated. In the older BH implementation, mark_bh() would set a bit in a bit mask, allowing the corresponding bottom-half handler to be found quickly at runtime. In modern kernels, it just calls tasklet_hi_schedule(), like tasklet_schedule, to schedule the bottom-half routine for execution.

Now, the last issue we want to discuss in interrupt handling is race condition. The interrupt-driven I/O introduces the problem of synchronizing concurrent access to shared data items and all the issues related to race condition. Since an interrupt can happen at any time, it can cause the interrupt handler to be executed in the middle of an arbitrary piece of driver code. Therefore, a device driver that is working with interrupts (In fact, it’s the most case) must be very concerned with race conditions. In Linux, there are many techniques to prevent data corruption,

65

but we only introduce the most common one: Using spinlocks to enforce mutual exclusion.

Spinlocks are represented by the type spinlock_t. There are a number of functions (actually macros) working with spinlocks: void spin_lock(spinlock_t *lock); void spin_lock_irqsave(spinlock_t *lock, unsigned long flags); void spin_lock_irq(spinlock_t *lock); void spin_lock_bh(spinlock_t *lock); void spin_unlock(spinlock_t *lock); void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags); void spin_unlock_irq(spinlock_t *lock); void spin_unlock_bh(spinlock_t *lock);

The spin_lock() spins (busy-wait) to acquire the given lock. Upon returning from spin_lock(), the caller owns the lock. The spin_lock_irqsave() also acquires the lock. In addition, it disables interrupts on the local processor and stores the current interrupt state in argument flags. The spin_lock_irq() acts like spin_lock_irqsave(), except that it does not save the current interrupt state. The spin_lock_bh() obtains the given lock and prevents the execution of bottom halves. Those unlock functions are the counterparts of the various locking primitives. The spin_unlock() unlocks the given lock. The spin_unlock_irqrestore() unlocks the given lock and enables interrupts depending on the flags value, which should comes from spin_lock_irqsave(). The spin_unlock_irq() unlocks the given lock and enables interrupts unconditionally. The spin_unlock_bh() unlocks the given lock and enables bottom-half processing. In each case, you should make sure that lock functions to be executed before unlock functions, and they are all paired. Otherwise, serious disorder may happen. Communicate with Hardware through I/O Ports

After probing hardware, the device driver can obtain the I/O ports and use them in its activities. Most hardware differentiates between 8-bit, 16-bit, and 32-bit ports. Therefore, a C program must call different functions to access ports of different sizes. The Linux kernel defines the following functions to access I/O ports.

unsigned inb (unsigned port); void outb (unsigned char bye, unsigned port); The inb() reads byte (8-bit) port, while the outb() writes byte port.

66

unsigned inw (unsigned port); void outw (unsigned char bye, unsigned port); The inw() reads 16-bit port, while the outw() writes 16-bit port. unsigned inl (unsigned port); void outl (unsigned char bye, unsigned port); The inl() reads 32-bit port, while the outl() writes 32-bit port. In addition to the single-shot in and out operations, there are string operations supported in Linux: void insb (unsigned port, void *addr, unsigned long count); void outsb (unsigned port, void *addr, unsigned long count); The insb() reads count bytes from byte port, and stores these bytes to memory starting at the address addr. The outsb() writes count bytes located at memory address addr to byte port. void insw (unsigned port, void *addr, unsigned long count); void outsw (unsigned port, void *addr, unsigned long count); Their operations are similar to the above functions, except the port is a 16-bit port. void insl (unsigned port, void *addr, unsigned long count); void outsl (unsigned port, void *addr, unsigned long count); Their operations are similar to the above functions, except the port is a 32-bit port. 2.5.3 Linux Open Source Implementation: A Network Device Driver

In this section, we use a real-world network device driver in Linux, ne2k-pci, as an example. A network device driver is used to be a “bridge” between the network interface card (NIC) and the protocol driver (e.g., TCP/IP protocol stack). Also, the interrupt-driven I/O is applied here. When a NIC receives packet, it notifies the OS by interrupting the CPU. Then, the interrupt handler transfers the incoming packets from NIC memory to system memory, processes the packet, and finally pushes it into the kernel queue to be handled by the bottom-half routine (e.g. TCP/IP protocol stack). When the kernel gets a packet to be sent out, it first passes the packet to the NIC driver. Then the driver will process the packet, such as filling the MAC address into the packet. Finally, the driver transfers the packet from system memory to NIC memory. After the packet is transmitted completely, the NIC will interrupt the CPU to notify the OS. Every time the interrupt handler finishes an interrupt, it will acknowledge the NIC by writing some messages to the NIC registers.

In Linux 2.4, there are two important data structures which are sk_buff and

67

net_device used in a NIC driver. The sk_buff structure represents a packet, while the net_device stands for a network device. We show that where these two data structures locate in Linux in Fig. 2.

In Fig. 2, the NIC driver gets a frame from NIC, then it allocates the space of sk_buff to hold this frame in the field “data” of sk_buff. Afterward, this frame “lives” in the kernel by the figure of sk_buff. The sk_buff structure is defined in header file <linux/skbuff.h>, and the following Table 2. explains major fields of sk_buff.

Field Meaning

head pointer to the start of sk_buff

data pointer to the start of “actual data” (packet)

tail pointer to the end of “actual data” (packet)

end pointer to the end of sk_buff

dev device that packets arrive on or leave by.

len length of “actual data” (packet)

pkt_type packet class

h transport layer header

nh network layer header

mac link layer header

Regarding the translation between net_device structure and local (not a data structure) , it means most fields of net_device get values in NIC driver and these values are generated in NIC driver locally. The net_device structure is defined in header file <linux/netdevice.h>, and Table 2. lists main fields of net_device.

Kernel sk_buff net_device

NIC driver

sk_buff framenet_device local

NIC

Figure 2. Location of sk_buff and net_device

Table 2. sk_buff structure

68

Before a driver can transmit and receive packets, it must do the initialization stuff which contains “registry of an interrupt handler” and “probe hardware”. We can use request_irq( ) to register an interrupt handler. However, this driver serves for PCI network device so it don’t have to do the real probing action. The fundamental job of a NIC driver is to deliver packets between kernel and a network device. Hence, we illustrate packet transmission and reception with ne2k-pci NIC driver in the Fig. 2 and Fig. 2.

Field Meaning

Name device name

base_addr device I/O address

irq device IRQ number

dev_addr hardware address

mtu interface MTU value

hard_start_xmit transmission service

Table 2. net_device structure

kernel (interrupt handler) ei_interrupt

(TX) ei_start_xmit

(RX) ei_receive

ei_tx_intr

NIC

1. dev->hard_start_xmit

2. ne2k_pci_block_output

3. NS8390_trigger_send

4. an interrupt occurs

5.

6. 7. NS8390_trigger_send

8. netif_wake_queue

Figure 2. packet transmission

69

In the packet transmission phase, when the kernel wants to send a packet, it calls transmission service routine dev->hard_start_xmit( ) which is actually implemented by ne2k-pci and named ei_start_xmit( ). ei_start_xmit( ) first uses ne2k_pci_block_output( ) which moves packets from system memory to NIC memory, then calls NS8390_trigger_send( ) that triggers the NIC to push packets out. As the packet transmission is completed, NIC issues an interrupt to cause the kernel’s attention. Consequently, the kernel calls the corresponding interrupt handler which is registered using request_irq( ) in initialization step, and here it means ei_interrupt( ). ei_interrupt( ) examines what the interrupt means, because this is an interrupt of transmission complete, it calls ei_tx_intr( ) to check for error and then trigger the next packet to be sent. Finally, ei_tx_intr( ) calls netif_wake_queue( ) to tell the kernel that it can go on to transmit packets. In the packet reception phase, as long as the NIC receive a packet, it causes an interrupt to tell the kernel that some packet has arrived. Then the kernel calls ei_interrupt( ) to handle this interrupt. ei_interrupt( ) finds out that this interrupt is due to a packet reception, so it calls ei_receive( ) to get it out of the NIC’s buffer. ei_receive( ) calls ne2k_pci_block_input( ) to move the packet from NIC memory to system memory, then it calls netif_rx( ) to enqueue this packet in some kernel’s queue to be processed by networking system which is like bottom-half routine to do the longish tasks.

kernel

(interrupt handler)

ei_interrupt

(TX) ei_start_xmit

(RX) ei_receive

ei_tx_intr

NIC

1. an interrupt occurs

2. 3.

4. ne2k_pci_block_input 5. netif_rx

Figure 2. packet reception

70

2.6 Pitfalls and fallacies Ethernet performance (utilization in half-duplex and full-duplex mode)

Researchers are interested in the maximum channel utilization of Ethernet under extremely heavy load, despite that the situation is unlikely to happen. Computer simulation, mathematical analysis, and real-world measurement, are possible approaches to obtain the value. Unlike simple mechanisms such as ALOHA, slotted ALOHA, analyzing full set of CSMA/CD mathematically is very difficult. As early as the invention of the experimental Ethernet at the Xerox lab, Bob Metcalfe and David Boggs had published a paper that reported a maximum of about 37 percent channel utilization the Ethernet can reach with their simplified model. Unfortunately, the value has been cited over years, even though the Ethernet technology has been almost different from the experimental one since the DIX Standard. Different FCS, different preamble, different address format, different PHY, and so on – except that the spirit of CSMA/CD was reserved, let alone the simplified model that is different from a real-world situation. Besides, 256 stations are assumed in the same collision domain, which is unlikely to happen in the real world.

A later paper published by David Boggs et al. in 1988 tried to clarify the pitfalls. They performed a real-world testing on a 10 Mb/s Ethernet system with 24 stations by flooding frames constantly. It showed the utilization is more than 95% with the maximum frame and about 90% with the minimum frame12 under stress testing. It showed Ethernet performance is rather satisfactory.

As switches become more popular, multi-segment networks are divided into many individual collision domains. The situation of many stations in the same collision domain is further reduced. Since the advent of full-duplex operation, there is no restriction imposed by CSMA/CD at all. Both sides of a link can transmit as fast as it can do. For a switch that affords maximum frame rate and data capacity, it is called a wire-speed or non-blocking switch.

Another interesting problem that might be of concern is that the data field in the Ethernet frame is not “long” enough. Compared with other technologies, say Token Ring, which has data field of 4528 bytes at 4 Mb/s and 18173 bytes at 16 or 100 Mb/s, the data field is only 1500 bytes out of 1518 bytes of a maximum

12 Bogg’s paper counts overheads in header, trailer, and IFG, into utilization. Hence, one hundred percent utilization is assumed if there is no collision despite those overheads in his paper.

71

untagged frame. People may be suspicious that the percentage of non-data overheads, including header information, trailer, and IFG, is larger than other technologies.

There is a historical reason why the Ethernet frame is not so long. Ethernet was invented more than 20 years ago. Memory was expensive at that time. The buffer memory for frames was quite limited in size on those days. It made sense to design a frame that is not too long, and nor is the data field.

Things are not that bad as they look! For large data transfer such as FTP traffic, which tends to transfer with long frames, the data field can occupy as high as 1500 / (1518+8+12) = 97.5% of the channel bandwidth. The overheads are quite low! It is hardly to improve this value significantly by increasing the maximum frame size. Collision domain, broadcast domain, and VLAN

The first two terms are often confused for students who first learn Ethernet. A collision domain is a range of network in which more than one transmission at the same time results in a collision. For example, a repeater hub and the stations attached to it form a collision domain. In contrast, a switch explicitly separates collision domain from one port to another. In other words, a transmission from a shared LAN attached to one port will not result in a collision with another transmission from the LAN belonging to another port.

However, when a frame has a broadcast address as the destination, a switch will still forward to all ports but the source. The range of network that the broadcast traffic can reach is a broadcast domain. Sometimes, we need to confine the broadcast traffic for security reason or bandwidth saving. A VLAN approach separates broadcast domains from one VLAN to another. It is a logical separation from physical connectivity. A device providing high-layer connectivity, such as a router, is needed to connect two or more separate VLANs.

5-4-3 rule and multi-segment networks

It is said that Ethernet follows the 5-4-3 rule. It sounds easy to remember. However, the rule is not as simple as it sounds. Besides, the rule is actually one of the conservative rules that validate the correctness of 10 Mb/s multi-segment Ethernet networks. It is not a law that every Ethernet deployment should follow. Let’s go to the detail.

As we mentioned, the round-trip propagation time in a collision domain

72

should not be too long for proper operation. Different transmission media and the number of repeater hubs offer different delays, however. As a quick guide for network administrators, the IEEE 802.3 Standard offers two Transmission System Models. Transmission System Model 1 is a set of configurations that meet the above requirements. In other words, if you follow these configurations, your network will work properly. Sometimes, you may need to deploy your network other than the configurations in Transmission System Model 1. You have to calculate yourself if your network is qualified for the requirements. Transmission System Model 2 offers a set of calculation aids to you. For example, it tells you the delay value of a segment of a certain medium type.

In Clause 13 “System considerations for multi-segment 10 Mb/s baseband networks,” the Standard has the following rule in the Transmission System Model 1:

“When a transmission path consists of four repeater sets and five segments, up to three of the segments may be mixing and the remainder must be link segments.” – cited from the Standard.

This is the face of the well-known 5-4-3 rule. Note the definitions of mixing segments and link segments. A mixing segment is a medium on which there are more than two physical interfaces. A link segment is a full-duplex-capable medium between exactly two physical interfaces. People often refer to a link segment as a segment without PCs, but it is not a precise description. The rule means if you configure your network this way, it can work.

As more and more segments operate in full-duplex mode, the significance of this rule is becoming minor. However, it is often overemphasized by those left in the history. Big-Endian and Little-Endian

Those who are familiar with network programming may be confused with Big-Endian and Little-Endian. They know network byte order, such as that of Internet Protocol (IP), uses Big-Endian byte ordering. However, our text in this chapter describes the Ethernet transmits data in Little-Endian order. Is there a contradiction?

Consider a four-byte word and let us denote each byte by b3b2b1b0 with decreasing order of significance. Here are two options in storing it in memory: 1. Store b3 in the lowest byte address, and b2 in the second lowest byte address,

and so on. 2. Store b3 in the highest byte address, and b2 in the second highest byte

address, and so on.

73

The former is known as the Big-Endian byte order, and the latter is known as the Little-Endian byte order. The ordering varies with the CPU and OS on a host. This results in inconsistency when transmitting some multi-byte data, say integers, over the network. To keep the consistency, a network byte ordering is enforced. The most popular network layer protocol, Internet Protocol, uses Big-Endian ordering. Whatever the host byte ordering is, the data should be converted into network byte ordering before transmitting and then be turned back into host byte ordering upon receipt, if there might be an inconsistency.

That’s the business of Internet Protocol. The data-link layer protocol receives data to be transmitted from the upper layer protocols byte by byte. What byte ordering on the upper layer protocols is of no consequence to the data-link layer protocol. The data-link layer protocol is concerned with bit ordering in transmission, not byte ordering.

Ethernet uses Little-Endian bit ordering. It transmits the least significant bit first and the most significant bit last in byte transmission. Conversely, Token Ring or FDDI transmits the most significant bit first and the least significant bit last in byte transmission. They are known to use Big-Endian bit ordering. They should not be confused with byte ordering. 2.7 Further readings General issues

Andrew S. Tanenbaum, ”Computer Networks,” Third Edition, Prentice Hall, 1996. This textbook introduces general computer networking concepts in a bottom-up approach, from physical layers to application networks.

William Stallings, “Data and Computer Communications,” Sixth Edition, Prentice Hall, 2000. This book focuses a little more on communications, besides computer networks.

Larry L. Peterson and Bruce S. Davie, “Computer Networks: A system approach,” Second Edition, Morgan Kaufmann, 2000. It is a newer textbook in computer networks. It covers new topics such as wireless LAN and VPN.

PPP W. Simpson, “The Point-to-Point Protocol (PPP),” RFC 1661, July 1994.

The RFC document defines PPP. L. Mamakos, K. Lidl, J. Evarts, D. Carrel, D. Simone, R. Wheeler, ”A

method for transmitting PPP over Ethernet,” RFC 2516, February 1999

74

The RFC document defines PPPoE. G. McGregor, “The PPP Internet Protocol Control Protocol (IPCP),”

RFC1332, May 1992. The RFC document defines IPCP.

Andrew Sun, “Using and Managing PPP,” O’reilly, 1999. The hands-on book introduces practical PPP operation on Unix.

Ethernet Rich Seifert, “Gigabit Ethernet,” Addison Wesley, 1998.

Rich Seifert is coauthor of the IEEE 802.1 and 802.3 Standard. His book characterizes technical accuracy and market insight. It is a must if you hope to get into technical details of Gigabit Ethernet without being fed up with the detailed but boring wording in the Standard.

Rich Seifert, “The Switch book,” John & Wiley, 2000. This book covers a full discussion of switches. You will find great details in STP, VLAN, link aggregation, etc. in his book.

Charles E. Spurgeon, “Ethernet: The Definitive Guide,” O’Reilly, 2000. Mr. Spurgeon is an experienced network architect. This book introduces the Ethernet from an administrative point view.

ISO/IEC Standard 8802-3, “Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifications,” 2000 Edition. This is the Standard document. As of April 15, 2001, all of the IEEE 802 Standards has been freely available on http://standards.ieee.org/getieee802/.

10 Gigabit Ethernet Alliance, “10 Gigabit Ethernet Technology Overview: White paper,” http://www.10gea.org, September 2001. This white paper is published by 10 Gigabit Alliance, a technical consortium intending to push the next generation 10 Gigabit Ethernet.

Howard Frazier, “Ethernet takes on the first mile,” IT Professional, vol. 3, issue 4, July-Aug. 2001. Mr. Frazier is chair of IEEE 802.3ah. He describes the future perspective of Ethernet on the first mile in this article.

Howard Frazier, “Ethernet in the first mile tutorial,” IEEE 802.3 EFM study group, http://www.ieee802.org/3/efm/public/jul01/tutorial/index.html, July 2001. This is a tutorial provides by the IEEE 802.3ah Task Force.

ISO/IEC Standard 15802-3, “Media Access Control (MAC) Bridges,” 1998 Edition.

75

It is the MAC bridge Standard, also available on the web site mentioned above.

IEEE 802.1Q, “Virtual Bridged Local Area Networks,” 1998 Edition. It is the VLAN bridge Standard, also available on the web site mentioned above.

Device Drivers A. Rubini and J. Corbet, “Linux Device Drivers,” Second Edition, O’

reilly, 2001. This is an excellent book that teaches you how to write Linux device drivers.

Wireless Protocols ANSI/IEEE Standard 802.11, “Wireless LAN Medium Access Control

(MAC) and Physical Layer (PHY) Specification,”1999 Edition. It is the wireless LAN Standard, also available on the web site mentioned above.

P. Brenner, “A Technical Tutorial on the IEEE 802.11 Protocol,” http://www.sss-mag.com/pdf/802_11tut.pdf. It is a good tutorial document of IEEE 802.11.

Bluetooth SIG, “Specification of the Bluetooth System,” Ver. 1.1, http://www.bluetooth.com/developer/specification/specification.asp, Feb 2001 .It is the standard document of the Bluetooth.

P. Bhagwat, “Bluetooth: Technology for Short-Range Wireless Apps,” IEEE Internet Computing, vol. 5, issue 3, pp. 96-103, May/June 2001. It is a good tutorial paper of the Bluetooth.

2.8 Exercises

Hands-on exercises 1. Read the two documents and see how the IEEE Standards comes out.

Write a summary of the standardization process. [1] 10 Gigabit Ethernet Alliance,”10 Gigabit Ethernet Technology Overview: White paper,” http://www.10gea.org, September 2001. [2] http://www.ieee802.org/3/efm/public/sep01/agenda_1_0901.pdf.

2. You may download IEEE 802 Standards at http://standards.ieee.org/getieee802/. Write down the development goals of the following projects: 802.1w, 802.3ac, 802.15, 802.16, and 802.17.

3. Find the MAC address of your network interface card. Check

76

http://standards.ieee.org/regauth/oui/oui.txt to compare its OUI with that has been registered.

4. Use Sniffer or similar software to find out how many kinds of “protocol types” in the “Type” field of the Ethernet frames you capture. What transport/application layer protocols, if any, do they belong to?

5. Find out whether your network interface card is operating in half-duplex or full-duplex mode.

6. Trace the source in one of the following protocols: 1. HDLC 2. PPPoE

3. wireless LAN 4. Bluetooth. Explain the purpose of each major function of the protocol implementation you trace and draw a flow chart with the function names to show the execution flow.

7. After make kernel and choose some drivers to be modularized, how do we compile driver, install driver, and run these modules? Please also write one small module. Show what commands are needed to compile and install it. How do you show your module has been successfully installed? (Hint: read insmod(8), rmmod(8), and lsmod(8).)

8. A packet’s life: test how much time a packet spends on the driver , DMA , and CSMA/CD adapter. (you can use“rdtscll”defined in <asm/msr.h> to get the past CPU clock cycle. )

9. Written exercises 1. We know 32-bit IPv4 addresses may be not enough. Are 48-bit MAC

addresses enough? Discuss it. 2. Read RFC1071 and RFC1624 to see how IP checksum is computed.

Practice with the trivial blocks of words by hand. 0x36f7 0xf670 0x2148 0x8912 0x2345 0x7863 0x0076 What if the first word above is changed into 0x36f6?

RFCs downloaded from ftp://ftp.csie.nctu.edu.tw/pub/Documents/RFC/. 3. Compute the CRC code given the message 1101010011 and the pattern

10011. Verify the code is correct. 4. Why are the destination address field usually located in the head of a

frame, and the FCS field located in the tail of a frame? 5. What are the advantages and disadvantages if we make the minimum

Ethernet frame larger? 6. Suppose data payload is prepended with 40 bytes of IP and TCP headers

in a frame. How many bits of data payload can be carried in the 100 Mb/s

77

Ethernet if each frame is a maximum untagged frame? 7. Should a switch recompute a new FCS of an incoming frame before it is

forwarded? 8. There is an optional priority tag in the Ethernet frame, but it is not often

employed. Why? 9. Why does not Ethernet implement a complicated flow control mechanism

such as sliding-window? 10. What happens if your network interface card runs in full-duplex mode in a

shared network? 11. Should each port in a switch have its own MAC address? Discuss it. 12. Suppose each entry in the address table of a switch needs to record the

MAC address, 8-bit of port number, and 2-bit of aging information. What is the minimum memory size if the table can record 4096 entries?

13. Suppose bit stuffing with 0 is used after 5 consecutive 1’s. Assuming the probabilities of 0’s and 1’s in the bit stream are equal and the occurrences are random, what is the transmission overhead of the bit stuffing scheme? (Hint: Formulate a recursive formula f(n) to find the expected number of overhead bits in an n-bit string first.)

14. Write a simulation program to verify the numerical answer above is correct.

15. In 1000BASE-X, a frame of 64 bytes is first block coded with 8B/10B before transmitting. Suppose the propagation speed is 2x108. What is the frame “length” in “meter”? (Suppose the cable is 500 m long.)

16. What is the probability of two stations taking 5 more trials to resolve collisions after they have the first collision? (Suppose only two stations are in the collision domain.)

17. What is the maximum number of frames a switch of 16 Fast Ethernet (100 Mb/s) ports may deal with if each port operates in full-duplex mode?

18. A CPU executes instructions at 800 MIPS. Data can be copied 64 bits at a time, with each 64-bit word copied costing six instructions. If an incoming frame has to be copied twice, how much bit rate, at most, of a line can the system handle? (Assume that all instructions run at the full 800-MIPS rate.)

19. A frame of 1500 bytes travel through 5 switches along the path. Each link has a bandwidth of 100 Mb/s, a length of 100 m, and propagation speed of 2x108 m/sec. Assuming the queueing and processing delay of 5 ms at each switch, what is the approximate end-to-end delay for this packet.

20. One out of n frames of 1000 bytes suffers from an error on average if the

78

bit error rate is 10-6. What is n? 21. Come up with a question and do it yourself.