Co-Scheduling Of Traffic Shaping And DMA Transfer In A High Performance Network Adapter

1

CO-SCHEDULING OF TRAFFIC SHAPING AND DMA TRANSFERIN A HIGH PERFORMANCE NETWORK ADAPTER

Haakon Bryhni, Per Gunningberg, Espen Klovning, Øivind Kure and Jochen Schiller.

Author’s Addresses(Corresponding Author) Haakon Bryhni, Dept. of Informatics, University of Oslo, Box 1080, N-0316 Oslo, Norway. Phone: +47 22 85 24 34 Fax: +47 22 85 24 01 Email: [email protected]

Per Gunningberg, Dept. of Computer Systems, University of Uppsala, Box 325, S-751 05 Uppsala,Sweden. Phone: +46 18 471 3171 Fax: +46 18 550225 Email: [email protected]

Espen Klovning, Telenor Research & Development, Box 83, N-2007 Kjeller, Norway. Phone: +4751 76 57 45 Fax: +47 51 76 57 30 Email: [email protected]

Øivind Kure, Telenor Research & Development, Box 83, N-2007 Kjeller, Norway. Phone: +47 6384 88 63 Fax +47 63 38 48 88 Email: [email protected]

Jochen Schiller, Institute of Telematics, University of Karlsruhe, D-76128 Karlsruhe, Germany.Phone +49 721 608 6415 Fax +49 721 38 80 97 Email: [email protected]

AbstractA high capacity network adapter design is presented, using a CPU-based approach for shaping traf-fic, scheduling cells for multiplexing and controlling a DMA engine for a large number of ATMconnections. By performing a scheduling operation for each cell, traffic shaping and data move-ment can be combined. Two scheduling algorithms are proposed, implemented and evaluated usingdetailed instrumentation on current high performance processors. We focus on the outbound direc-tion since this is where most data is moved in a large network server, such as a high capacityWWW server. It is anticipated that the rapid increase in processing power can enable future serversto perform such scheduling in real time, even on very high capacity network connections. Wepresent a general architecture for the adapter, discuss how shaping and scheduling can be com-bined, and present measurements that show how the proposed scheduling algorithms perform closeto the real time requirements of high capacity network connections, even using current general-pur-pose microprocessors. We have chosen ATM for our evaluation, although the scheduling algo-rithms are general and may be employed in any network technology with flow reservations, such asthe next generation Internet protocols.

KeywordsNetwork Adapter, Traffic Shaping, ATM, QoS, Traffic Contracts, Large Servers, Scheduling,GCRA, DMA.

2

1 INTRODUCTIONFuture network servers are expected to handle a large number of concurrent connections with vary-ing Quality of Service (QoS) requirements [1]. Examples of likely traffic sources are voice/videoconferences integrated with multimedia documents, multimedia document retrieval, WWW, filetransfer and real time interactive games. The bandwidth requirements will range from a few Kbit/sto several Mbit/s. Thus, servers using network technologies with link bandwidths of hundreds ofMbit/s have the potential to carry a substantial number of these connections. The problem whicharises is to handle all these connections and their QoS requirements.

As an example, existing network adapters for Asynchronous Transfer Mode (ATM) can only sup-port a limited set of concurrent connections. Most of these adapters use hardware chips. Typically,they can handle 16-128 concurrent connections with different traffic parameters (e.g., [2], [3], [4]),and ensure that the connections behave according to their traffic contracts. If the traffic contractsare not respected, policing functions at the edge of the network may discard cells to ensure isolationbetween users and avoid overload in the network. The policing entities in the public network canalready police thousands of concurrent connections according to different traffic parameters (e.g.,[5], [6]). New capabilities such as scalable shaping and traffic scheduling is needed to support thou-sands of connections also in the end system where traffic is generated.

In our view, a more advanced network adapter is needed to perform traffic shaping. Our hypothesisis that a CPU-based network adapter can handle a large number of connections which is not possi-ble for hardware solutions. With a CPU-based approach we can shape individual connections in theoutbound direction. Using state of the art microprocessors to handle connections with different traf-fic characteristics has several benefits. The algorithms can easily be changed to suit the needs ofdifferent traffic types. In addition, a software implementation makes it easier to use new on-boardprocessors as they become available. A software based solution will scale with the increase in pro-cessing power provided the memory system keeps up [7]. An additional issue which is investigatedin this paper is if and how the traffic shaping and the scheduling of memory transactions can becombined. The contribution of this work is a feasibility study of two different software schedulersrunning on an on-board network adapter CPU combining the traffic shaping functionality with thescheduling of memory transactions.

Our scenario is a large network server (e.g. WWW server) based on a set of multiprocessors con-nected to a high speed memory interconnect as shown in Figure 1. The focus of our work has beenthe outbound traffic which is the most important for a network server. We assume that interconnectspeed is not a bottleneck when data is moved from memory to the network adapter. This is a rea-sonable assumption, since current multiprocessor interconnects and even I/O adapter based inter-connects based on SCI [8], Memory Channel [9] or ServerNet [10] provide capacity at least anorder of magnitude higher than the link speed at the network adapter. The network technology wehave envisioned is ATM since it supports QoS parameters and traffic contracts for every connec-

Figure 1 System architecture

CPU

InterconnectMass

storage

Network adapter

Wide Area Network

CPU

CPU

CPU

https://www.researchgate.net/publication/2429866_Hitting_the_Memory_Wall_Implications_of_the_Obvious?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3214755_TNet_A_reliable_system_area_network?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

3

tion. However, the proposed traffic shaping algorithms used by the network adapter are general andmay be employed in any network technology with flow reservations, such as the next generationInternet protocols. In this work, we focus on how to schedule connections which have beenaccepted by the system. The connection acceptance phase [11] and the QoS architecture for theapplications and higher layer protocols [1][12] which are also crucial to the server is outside thescope of this work.

The most important issues for the advanced network adapter is to handle thousands of connectionssimultaneously and to integrate traffic shaping and scheduling of memory transactions.

Handling of many concurrent connectionsCurrent WWW servers serve requests using a best-effort method, which is not sufficient when theindividual connections have QoS requirements and as the number of connections increase. Weassume that a future large WWW server must handle many concurrent connections with potentiallyhighly different QoS requirements. For connections with variable service rate, the cell flow alsochange rapidly over time. Thus, decisions on which connection to serve next must be made on thefly without jeopardizing the QoS for other connections currently active. Such systems requireadvanced solutions for handling the connections and an integration of the network subsystem intothe system´s global resource management [13]. Thus, if the network interface is ATM, the schedul-ing interval for a single 53 byte ATM cell is in the nanoseconds or microseconds range dependingon the link speed. For an 622 Mbit/s ATM network link, the scheduling interval is roughly 680 nswhich is an highly demanding time-scale. The timing requirements will be more relaxed for a simi-lar packet scheduler utilizing variable packet sizes provided the average packet size is larger thanthe cell payload. Although our focus has been on cell level scheduling for ATM interfaces, variablepacket length scheduling is also discussed in the paper.

Integration of traffic shaping and scheduling of memory transactionsIn addition to handling all the connections, the advanced network adapter needs an efficient datatransfer from main memory to a network interface, that is adapted to the delivery requirements ofthe connections towards the network. Otherwise unnecessary buffering is required. The design ofthe scheduling algorithm should not rely on specific knowledge about existing protocol implemen-tations, networks, or applications, since they will change with time.

An interesting research question is whether traffic shaping and scheduling of DMA transfers can beefficiently combined. These operations are often decoupled, leading to hardware redundancy, dou-ble buffering of data, and the limited capacity for shaping in the I/O subsystem. If the schedulingand traffic shaping functions can be combined, there is a potential benefit of moving data onlywhen it is needed.

Related workDifferent packet and cell level schedulers have been proposed in the literature [14], [15], [16], butprimarily applied for fair switch level scheduling, without regards to packet or cell generation at thenetwork adapter nor connection (VC level) specific QoS requirements. For many approachesdetailed simulation and analysis of the shaping and scheduling algorithms has been done, e.g., [17],integration with DMA transfer scheduling and real implementation is often missing. Actual imple-mentations of scheduling and shaping mechanisms are mostly hardware based and therefore, pro-vide only limited capabilities due to chip space limitations [2], [3], [4], [18], [19].

Research in the network adapter area concentrated mostly on the efficient data transfer from thenetwork to an application without unnecessary copying [20], [21], [22]. Often, those approachesrequire specialized and, thus, more expensive hardware. In addition, some operating system aspectshave been considered to guarantee a share of processing power for the communication [23], [24].Efficient memory management and architecture of the communication subsystem [25] has alsobeen investigated in order to preserve the link speed to the applications. However, handling of thou-

https://www.researchgate.net/publication/234811923_Exokernel_An_operating_system_architecture_for_application-level_resource_management?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/221164131_Experiences_with_a_High-Speed_Network_Adaptor_A_Software_Perspective?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3334401_Efficient_network_QoS_provisioning_based_on_per_node_trafficshaping?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3718578_A_near-optimal_packet_scheduler_for_QoS_networks?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3233520_The_Design_of_a_QoS-Controlled_ATM-Based_Communications_System_in_Chorus?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3334265_Two-Dimensional_Round-Robin_Schedulers_for_Packet_Switches_with_Multiple_Input_Queues?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/2464303_Lazy_Receiver_Processing_(LRP)_A_Network_Subsystem_Architecture_for_Server_Systems?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3282406_Designing_communication_subsystems_for_high-speed_networks?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3232953_HardwareSoftware_Organization_of_a_High-Performance_ATM_Host_Interface?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/2752310_A_Framework_for_QoS_Guarantees_for_Multimedia_Applications_within_an_Endsystem?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3334462_A_schedulability_condition_for_deadline-ordered_service_disciplines?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

https://www.researchgate.net/publication/3718425_A_scalable_architecture_for_fair_leaky-bucket_shaping?el=1_x_8&enrichId=rgreq-69713770-de64-4067-9298-3fcd3898eaf8&enrichSource=Y292ZXJQYWdlOzIyODg1NzI0NztBUzoyMDExMjI4NTg5NjcwNDFAMTQyNDk2MjgzNDY4NA==

4

sands of simultaneous connections and generation of traffic conforming to traffic contracts has notbeen the subject of these approaches.

The paper is organized as follows. Section 2 presents an overview of the network adapter architec-ture and some important architectural issues. An outline of the scheduling problem and the two dif-ferent solutions are presented in section 3. The prototype implementations and evaluation of thesesolutions are discussed in sections 4 and 5 respectively. In section 6 we present our conclusionsregarding the feasibility of the proposed architecture.

2 NETWORK ADAPTER ARCHITECTUREThe network adapter is responsible for several functions in the outbound direction. It should trans-fer data from server memory, apply low level protocol functions, control the transmission of dataaccording to a traffic contract and multiplex several connections. Our thesis is that a CPU basedapproach for doing these functions for a large number of connection is both feasible and efficient.

The overall design is illustrated in Figure 2. The adapter has a buffer memory to hold cells to betransmitted and a DMA engine that reads data from server memory into the buffer memory. Eachconnection has a separate queue of cells in the memory. Buffer memory is accessed from the ATMchip which does the actual transmission. The purpose of the CPU is to schedule the DMA enginefor a data transfer, schedule the ATM chip to read a cell from the buffer memory according to thetraffic shape and multiplexing state, and exchange control messages with the host. The adapter mayhave other protocol hardware on board, such as for AAL5 checksum calculation which also mustbe controlled by the CPU. In addition, there is some control logic that is specific to the intercon-nect. Note that the CPU does not touch data at all since this would be too time consuming.

The CPU needs a fairly large memory to hold the state of each connection and the data structure forcell scheduling. The access time to this memory is crucial for the performance and it is expectedthat a large Level 2 (L2) cache memory is the most appropriate. The actual scheduling code is smallenough to fit into Level 1 (L1) instruction cache.

The design idea is that the CPU should control the DMA and ATM functions. This means that theCPU has enough information to bring in data from the server memory just in time for transmissionsince it is deterministic when the next cell in a connection is allowed to be sent according to theshaping algorithm. The CPU will do this shaping by keeping a state for each connection. Further-more, the CPU is in full control of the multiplexing of several cells by maintaining a ready-list ofcells for transmission. With this information it is predictable when data needs to be fetched frommemory.

Figure 2 Network adapter architecture

PDUs

ATM

Schedule

Shaped12

n

12

n

Server Memory

Contract

Server Interconnect

DMA

CPUand shaping

traffic

OrderedReady

Cache Memory

List state

Select cellCell

Buffers

5

2.1 Overall scheduling algorithmThe CPU is running the following cycle for each transmission cell slot available.

1. Pick the first cell in the ready-list of cells.2. Initiate the transfer of this cell to the ATM chip.3. Calculate the time when the next cell in this connection should be sent by updating the traffic

shape state for this connection.4. Find an empty slot in the future for this next cell and insert it in the ready-list5. Schedule a possible DMA transfer for this connection

This cycle must finish well within the time it takes to send a cell. For a 622 Mbit/s this meanswithin 680 ns. Besides running this cycle, the CPU needs to synchronize PDU information with theserver and to allocate and deallocate buffer memory. But these tasks are triggered by asynchronousevents and can be done in the background.

The first step is to pick a cell for transmission. The cells are scheduled to cell slots, i.e. the timewhen a cell should be sent according to traffic parameters. The ready-list consists of all cellsalready scheduled for future cell slots. We keep this ready-list sorted at all times and therefore thefirst element is the cell closest in time. Some cell slots will be empty when there are no connectionsthat can use it. An already available cell in a connection can not use this empty slot if it will breakthe traffic parameters. If the slot is used, the cell will most likely be discarded by the UPC (UsageParameter Control) mechanism as an early cell outside the contract.

The ready-list will have one entry for each active connection. For 64K connections, the size of thislist is considerable and the time it takes to keep the list sorted is critical for the performance. Thedesign alternatives and trade-offs of different data structures are discussed in section 4.1 and sec-tion 4.2.

After initiating the cell transmission, the CPU will retrieve the connection state for the transmittedcell. The state holds all the parameters for the traffic shaping. From this information the CPU willcalculate when the next cell in the connection should be scheduled for transmission. The CPU willthen try to allocate the corresponding slot to this time by checking the ready-list. If this slot isalready occupied by a cell from another connection the scheduler may try to find an empty slot laterin time or to reschedule the conflicting cells using the CDVT (Cell Delay Variation Tolerance)parameter. After the cell has been inserted into the list, the CPU will update the state of the connec-tion. Design issues for the scheduler include efficient utilization of the bandwidth, fairness at over-load situations and minimizing the number of CPU cycles needed for the scheduling. These issuesare discussed in section 4 and section 5.

The last step in the cycle is to initiate possible DMA transfers. Given the deterministic informationfrom the traffic shaping it should be possible to keep most of the PDUs in server memory and tomove data just in time for transmission to the adapter. There are several potential advantages byusing this information: prefetching of the right amount of data will avoid delays caused by demandfetching, buffer size requirements can be reduced and long blocking times to other interconnnecttransactions can be avoided. By coupling the DMA transfer for a connection with the actual send-ing of cells for this connection the design is simplified. The amount of cell buffer memory neededcan then be decided and the shared buffer problem with asynchronous readers and writers isavoided, since the transfer is synchronized with the transmission, i.e. the consumption of data.

There are a couple of DMA design issues that have to be addressed. The first iswhen a transfershould take place. The interconnect has some access time variance that must be compensated for.This variance motivates an earlier transfer than just before the data is needed. The second issue isthe size of the data transfer unit. The smallest unit is a cell and the largest is the PDU. A small sizewill cause more overhead while a big unit will consume cell memory buffers and may block othertransfers. In this trade-off, the optimal transfer size of the interconnect must also be considered forefficiency. More detailed information on these issues can be found in section 4.3.

6

3 SCHEDULING AND SHAPINGThe focus of our research is an architecture that can combine shaping and DMA transfer for serverswith several thousand of connections with a given QoS. The underlying assumption is that thereexists some sort of traffic contract for each connection. However, the traffic contract only specifiesthe boundaries the transmission patterns have to be within. As we will discuss in this section, thereis a substantial freedom in traffic shaping.

We have analyzed two different traffic shaping functions. They both conform to the policing; i.e.they produce cell or packet streams that are in conformance with a traffic contract specified by theGeneric Cell Rate Algorithms (GCRA)[26]. However, the traffic patterns diverge, and the imple-mentation cost is different. For simplicity, the description and the analysis use ATM as a basis.However, the results are not restricted to ATM, but will apply to any communication system withtraffic contracts as discussed later.

3.1 BackgroundA traffic contract can be defined by the following parameters, a sustainable cell rate (SCR), a peakcell rate (PCR), a burst length or tolerance (Maximum Burst Size, MBS) and a cell delay variationtolerance (CDVT). In a public network there has to be an entity that controls the conformance ofthe traffic streams against the actual contract for the stream in order to ensure isolation between dif-ferent customers. A Generic Cell Rate Algorithm (GCRA) is defined to determine the conformancebetween a cell stream and the traffic contract. Two equivalent policing algorithms are the LeakyBucket and Virtual Scheduling algorithms, termed LBA and VSA.

We use the terminology from the VSA that is documented in the ATM Forum UNI specification[26]. For each connection a theoretical arrival time (TAT) at the policing unit is calculated. Thebasis is the SCR and the PCR. For each arrival time there is also a range defining the maximumallowable time a cell can arrive before the theoretical arrival time (defined as L). A cell is defined asbeing within the contract if the arrival timet > Tenable = MAX(TATpcr-Lpcr,TATscr-Lscr). If a cell issent afterTATscr = Tdeadline, the SCR cannot be reached temporarily. However, sending cells toolate does not violate a traffic contract but increases delay. In this work, we use a granularity of cellslots forTenable andTdeadline. However, fractions of cell slots can be represented as discussed in[28]. Figure 3 shows the different VSA parameters and their relationship.

The leaky bucket algorithm uses tokens filling up a bucket to express the rate of a cell stream. Atoken is removed whenever a cell is received. Only cells received while there are tokens in thebucket are in conformance with the contract. The algorithm can also be expressed in reversedterms, i.e., adding tokens when cells are received and emptying the bucket at a given rate.

3.2 Objective function for shapingThe normal operating point of a link is that the sum of the sustainable rates is less or equal to thebandwidth of the link, while the sum of peak rates is larger than the bandwidth. Only in this fashion

Figure 3 Relationship between the epochs used in the VSA algorithm

Earliest possibleschedule time

TATpcr TATscr

Time

Within the boundaries of the traffic

TATpcr-LpcrTATscr-Lscr

Tenable Tdeadline

“Latest”schedule time

7

statistical multiplexing gains can be achieved. A consequence of full utilization is, that it is likelythat several connections have data ready for transmission at the same time. It is also likely that eachof these connections have several cells available at the same time, since PDUs are generally largerthan one cell.

All of these cells could potentially be sent immediately at PCR rate and still be in conformancewith the traffic contracts. The traffic contracts themselves however, do not regulate how the actualcell transmission pattern should be; they only define the boundaries. Determining the “best” patternis within the domain of the one owning the communication stream, i.e. the one controlling the endsystem. Many different patterns are feasible; as a minimal functionality the architecture of a net-work interface card needs to define and implement a transmission policy defining a conforming pat-tern. Different network interfaces may use different strategies for how they schedule the differenttraffic types, among the many allowable sequences. The desired functionality would be to have theflexibility to support different transmission policies depending on the application needs and whatthe network can handle.

Our aim is to investigate the feasibility of implementing a combined traffic shaper and DMA sched-uler in software for interfaces handling several thousands of connections. In principle, this allowsindividual policies to be implemented in an end system. The policies explored in our work are oneaimed at maximizing bursty cell traffic from a source, and one aimed at achieving the sustainablerates. We discuss the implementation of these policies in section 4 and evaluate their feasibility insection 5.

3.3 Description of the algorithmsThe two algorithms have the same basis. They are work conservative in the sense that an idle cellwill never be transmitted if an eligible candidate exist. Both use the VSA algorithm to determinewhen a cell is eligible for transmission (Tenable). In both schemes, a stream belongs to a priorityclass, and whenever there is more than one cell contending for a cell slot, the cell belonging to thestream with the highest priority class will be transmitted. Priority classes are currently used inmany ATM systems, thus we found it reasonable to include priority classes concept to model a real-istic future system.

Scheduler optimized for BurstinessThis scheduler tries to preserve the bursts specified in a traffic contract. Two clearly defined pat-terns are an on-off pattern, i.e. a burst with PCR rate followed by an idle period (Dual LeakyBucket, DLB), or a bimodal one with data transmitted as a burst at PCR rate followed by a periodwith the sustained rate as shown in Figure 5 (Single Leaky Bucket, SLB). The leaky bucket algo-rithm is used to keep track of the state of these transmission patterns, i.e. whether it is transmittingat peak cell rate, at the sustained cell rate, or whether it is inactive. In addition, the VSA is used todetermine the first possible transmission timeTenable.

Figure 4 Sequence of action in the two scheduling approaches

Priority

Data Structure

Priority

Proportional sharing

Data Structure

Scheduler optimized Scheduler optimized

Tenabled

Tdeadline

for preserving Sustained ratefor preserving Burstiness

8

If two connections have the sameTenable for their next cell, the scheduler selects them according totheir priority as illustrated in Figure 4. If connections have the same priority, proportional sharing isapplied in an overload condition such that each connection get its fair share of the capacity. Thefinal decisive factor is the insertion order in the data structure.

Scheduler optimized to preserve Sustained ratesThe objective function for this algorithm is to achieve the sustained rate specified in the traffic con-tract during overload situation. The algorithm uses VSA to calculate the first possible time a cellcan be transmitted (Tenable). This is only used to identify the set of possible candidates. In addition,the theoretical arrival time at the sustainable cell rate is used as a deadline for when the cells shouldat the latest leave the interface (Tdeadline). An earliestTdeadline first scheme is used to scheduleamong the possible candidates. This scheduler will also depend on the insertion order in the datastructure when the priority and the Tdeadline is the same as illustrated in Figure 4.

3.4 Consequences of the algorithmsThe scheduling algorithm optimized for burstiness tends to preserve the bursty nature of the com-munication stream. It uses the state of the connection to determine the next time a cell should betransmitted. Once the connection has a burst that is within the contract, it will be transmitted atpeak cell rate. Without any other mechanisms this would be an unfair algorithm during shorteroverloads; one connection transmitting at peak rate equal the link rate would delay others. This sit-uation can only be transient, since with a sum of SCR less than or equal to the link rate, all connec-tions will eventually be allowed to transmit. A proportional sharing mechanism is implemented tointerleave bursts from several connections if a transient overload exists (see [28] for a detaileddescription). A small example of the scheduling strategy is shown in Figure 6A, where we canobserve how the VBR (Variable Bit Rate) connections (1-4) are favored in front of the CBR (Con-stant Bit Rate) connection. When the VBR connections are finished, the CBR connection gets therequested cell rate (the cell rate is represented by the connection’s slope in the figure).

The scheduling algorithm optimized to preserve sustained rates uses theTdeadline value to preservethe sustainable cell rate. If possible, no cell will be transmitted later than at the constant rate givenby the SCR in the traffic contract. Transmitting eligible cells at peak cell rate is not an objective,and bursts at peak rate will therefore tend to be smoothed out. In the extreme situation where allconnections have data to send at peak rate, the shaper would transmit the data as cells interleaved atsustained rate. The algorithm is illustrated in Figure 6B, where one can see how the CBR connec-tion is favored in front of the VBR1-4, thus preserving the CBR sustained rate.

Figure 5 Typical shapes of traffic generated by single and dual leaky bucket shaper

PCR1

PCR2

SCR3

SCR2

PCR3=

Dual Leaky Buckets (DLB)

SingleLeakyBucket(SLB)

Cell Rate

Time

CBR

9

The two algorithms are best suited for different types of traffic. In a transaction system, the burstsizes should be larger than the packet size in the transaction, the peak rate should reflect the desireddelay, and the sustainable rate should reflect the overall load. If the burst is corresponding to aPDU, preserving the sustainable rate will introduce more delay than the other algorithm. WWWrequest/response transactions is an example of traffic that will benefit from a burst-preservingscheduler.

For traffic where the variation in the delay is more important, the algorithm that tends to smooth outthe traffic will have its advantage. A mixture of audio and video streams is a good example for thistype of traffic. From a network operator viewpoint, the traffic pattern with the most interleaving ofconnections, is preferable.

If there is no contention for cell slots, the output of the two schedulers is the same. The differenceshows only in overload conditions, as illustrated in figure 7. VC1-3 are CBR connections, whileVC4-6 are VBR connections creating a temporary overload between cell slot 300 and 500. We canobserve that the scheduler optimized for burstiness (Figure 7A) is favoring the bursty connections(VC4-6) and is pushing back the CBR connections (VC1-3). We can also see that the bursty con-nections finish earlier. The scheduler optimized to preserve sustained rates (Figure 7B) is smooth-ing out the VBR connections to preserve the sustained cell rates for all connections. Thus, VC4-6finish later and VC1-3 finish earlier in figure A compared to figure B.

3.5 Best effort servicesUp to now the focus of our traffic contract discussions have been CBR or VBR contracts in an ATMcontext. But it is also important to support best effort services like UBR (Unrestricted Bit Rate) andadaptive services like ABR (Available Bit Rate)[26]. For UBR one can limit the PCR trying toavoid unnecessary cell loss downstreams. If there is UBR traffic to send, it will use idle cell slots,but back-off as soon as other traffic sources have data to send or its PCR limit is reached. UBR traf-fic can be handled by assigning the lowest priority to this traffic class. The PCR, which of coursecan be the link speed, can be controlled by the scheduling algorithms.

ABR does not need new mechanisms since it can be seen as VBR with dynamically adjusted band-width. In ABR, a minimum user requested SCR must be guaranteed. As opposed to UBR, ABRshould get a fair share even in transient overload situations. Thus, ABR traffic is assigned to ahigher priority class than UBR. The ABR feedback mechanism is independent of this work sinceon-board ABR chips will dynamically adjust the rate according to the feedback signal [29]. Theinformation must be reflected in the context information of the ABR connections so the cell levelscheduler can adjust the scheduling.

Figure 6 Simple scenario with one CBR and four VBR connections

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120 140

Cel

l num

ber

Cell slot

VBR1

VBR2

VBR3

VBR4

CBR

A) Scheduler optimized for burstiness B) Scheduler optimized to preserve sustained rates

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120 140

Cel

l num

ber

Cell slot

VBR1

VBR2

VBR3

VBR4

CBR

10

A very common priority set-up would be to give CBR traffic the highest priority, followed by VBR-rt (real-time), VBR-nrt (non real-time), ABR and finally UBR with the lowest priority. The algo-rithms can now guarantee the different classes a minimum share, except for UBR which does notneed one. Within priority classes proportional sharing is applied, every priority class can get thefull bandwidth if no cells from other connections at higher priority levels are ready for transmis-sion.

3.6 Scheduling variable sized packetsThe basic mechanisms of the two algorithms discussed in this paper are also applicable to variablepacket sizes. One example of variable packet sizes is the scheduling of IPv6 packets, where theflow id is used as a connection identifier. One way to schedule variable sized packets is to changethe parameters used in the algorithms as follows. Rates are expresses in bytes/s, and theTenable andTdeadline values are adjusted according to the size of the packet. The tokens in the LBA now countbytes instead of cells. Again, for the LBA the token rate will ensure that a connection gets band-width up to its share, this also includes long term fairness between connections. For all networksthere exists a maximum packet size, e.g., 1500 bytes for Ethernet. This limits the delay of packetsdue to multiplexing and blocking. In the LBA case a packet is eligible for sending if there areenough tokens in the bucket. This again represents the timeTenable. This very simple extension ofthe algorithms for variable size packets includes a certain unfairness and introduces jitter due toblocking by larger packets. However, if larger packets are allowed in a system, the increase in jittercannot be avoided. Fairness problems can increase in transient overload situations, but this is cur-rently under study. Detailed evaluation and performance measurements are part of ongoing work.

4 IMPLEMENTATIONThe two scheduler implementations both produce cell streams that will be accepted by the GCRAalgorithm at the User Network Interface (UNI) and the Network Network Interface (NNI) in thecase of ATM networks. However, they differ in the scheduling policy and in the way they handleoverload conditions. In the following we present the two implementations, and conclude the sectionwith a discussion of DMA design issues that are common to the two algorithms. Section 4.1 presentthe implementation of a scheduler optimized for Burstiness and section 4.2 discuss the implementa-tion of a scheduler optimized to preserve Sustained Rates.

Figure 7 Cell sequence of the two scheduling algorithms (note overload in cell slot 300-500)

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200

Cel

l num

ber

Cell slot

VC1

VC2

VC3

VC4

VC5

VC6

A) Scheduler optimized for burstiness B) Scheduler optimized to preserve sustained rates

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200

Cel

l num

ber

Cell slot

VC1

VC2

VC3

VC4

VC5

VC6

11

4.1 Scheduler optimized for BurstinessIn this implementation, the CPU is running three interacting tasks (Figure 8); the scheduler, theshaper and the DMA transfer task. The purpose of the shaper is to feed the cell level scheduler withtiming information for its schedule, in return, the scheduler hands over the next connection to beshaped to the shaper. The timing information that the shaper provides for the scheduler describesthe earliest possible timeTenable the next cell of this connection can be scheduled. According to thisinformation the scheduler updates the ready list used to store context data for all active connections.The information the scheduler hands over to the shaper is a pointer to all context information of aconnection with the closestTenable (PCR, SCR, bucket state, PDU list etc.).

The shaping information is used to prefetch new data from host memory. Therefore, the shaper hasknowledge about the actual amount of data stored for a connection on the adapter and the PDUsstored in host memory. As soon it is necessary and viable to prefetch new data the DMA transferengine is activated. Details about theses mechanisms will be discussed in the following.

The implementation of the traffic shaper uses a combination of the virtual scheduling algorithm(VSA) and leaky bucket algorithm (LBA). The LBA is used to determine the actual state of the con-nection (inactive, PCR, SCR), while the VSA is used to provide time information for the cell levelscheduler and to refill the token bucket.

The scheduler uses theTenable (i.e. the earliest point in time a cell is allowed to be sent without vio-lating the traffic contract) and a queue with the connection holding the cell with the closestTenableon top of the queue. If two connections have the sameTenable for their next cell, the schedulerselects them according to their priority. If the connections share the same priority, the sending orderis not further determined.

The data structure chosen for this task is an array-based heap. Such a heap can be implemented effi-ciently without the use of pointers. Compared to other tree-based solutions, no resorting or rear-rangement is necessary. Due to the array implementation, shift- and compare-operations can beused to traverse the tree. Ifai is the father node of a sub-tree, thena2i anda2i+1 are the two children.The father node ofai is alwaysa[i/2].

Three functions have been realized for this data structure (the elements mentioned in the followingconsist of two 32 bit values as indicated in figure 9):

• Insertion of a new element: a new element is inserted at the bottom of the heap and shifted upuntil it has reached the right position and the heap is partially ordered again.

• Removal of an element: the element at the top of the heap is removed, the bottom element isinserted at the top and shifted until it has reached the correct position in the heap.

• Checking of the first element: this is a special insertion function from the top of the heap tospeed up the implementation. High-bandwidth connections are scheduled more often than low-bandwidth connections. It can even happen, that a connection just scheduled has again the clos-estTenable or aTenable very close to the closest one. Therefore, it is in general much more effi-cient to check the newTenable of a connection and resort the heap from the top.

Figure 8 Interaction of shaper, scheduler, and DMA transfer engine

ShaperScheduler DMA

Transferconnection

Tenable

context

prefetchdata

12

Figure 9 gives an overview of the data structures used by the scheduler and shaper. All informationkept in the heap has to be small to assure that most of the heap fits into the cache of a processor(e.g. 512k L2 cache for a Pentium Pro). Therefore, the heap contains only the key to sort(next_schedulei) and a pointer to the context information (ptri). The shaper will only need tofetch the context information into the cache when a connection is scheduled. This structure makesit possible for the scheduler to handle a large number of connections simultaneously. Assuming 32bit values fornext_schedule and the pointer to the context information the heap needs only 512Kbyte for 64K active connections.

Priorities are implemented using a separate heap for each priority class. This guarantees that thedifferent priority classes do not interfere with each other making distribution on multiprocessorsystems simpler. For every cell slot, the scheduler starts checking the heap with the highest priorityif a cell is available for sending. If not, the scheduler moves on to lower priorities. If none of theheaps has a cell available, an idle cell is generated. This assures, that a cell with a higher prioritywill always be scheduled before lower priority cells. Without any additional mechanisms thiswould also mean starvation of lower priority traffic during overload at higher priorities. Therefore,we have included a fair sharing mechanism for overload situations which guarantees every connec-tion of a priority class a share of the bandwidth proportional to its actual sending rate. This kind ofproportional sharing and overload behavior have not been implemented yet in any of the availabletraffic shaper chips. Either these implementations avoid overload situations through reservationpolicies [3] or they throttle the total traffic via a leaky bucket regardless of the different need of theconnections [2], [4].

This implementation of priorities can even increase the performance. The reason for this lies in thesize of the heap the algorithm has to traverse in common cases. If one assumes some high priority,high bandwidth connections and a large amount of lower bandwidth, lower priority connection, thismeans for the implementation, that most of the time only the smaller heap for the connections witha higher priority has to be accessed and updated. The worst case for the implementation with prior-ities adds to the implementation without any priorities only one comparison per priority level andscheduled cell (checking if a cell for a higher priority is available).

The DMA transfer engine is triggered by the shaper and fetches then autonomously data from thehost memory and puts it into the cell buffer memory at the right connection queue. The shaperdecides when to transfer data and how much. The shaper takes the current interconnect load and themost efficient data transfer size into consideration for this decision. For a predictable system, theadapter should be guaranteed interconnect bandwidth and bounded access time.

Figure 9 Data Structures for the Scheduler and Shaper

next_schedule1ptr1

next_schedule2ptr2

next_schedule4ptr4

next_schedule3ptr3

next_schedulenptrn

context

information1

PDUinformation1,1

PDUinformation1,2

context

informationn

PDUinformationn,1

PDUinformationn,2

o o o

o o o

heap for thescheduler

context information for the shaper

13

Interaction between the shaper and schedulerThe most interesting part and also one novelty of this implementation is the way the shaper and thescheduler cooperate to fulfill the task of sending the right cell at the right time according to trafficcontracts and the actual load.

Although the shaping is based on the LBA, the newTenable the shaper calculates after sending a cellis based on the VSA. This newTenable is also the information the shaper returns to the schedulerafter shaping. After a cell has been scheduled for a connection the shaper calculates the earliesttime the next cell of this connection can be scheduled without violating the traffic contract. Thisnew Tenable is calculated based on the current state of the connection (PCR, SCR, inactive), themode (SLB or DLB) and the actual load. If a connection became inactive due to the fact that nomore data is currently available for this connection the shaper returns 0 to inform the scheduler thatthis connection can be removed from the heap of active connections of this priority. If the connec-tion is in PCR or SCR, the shaper returns the newTenable for the next cell. Thus, it is guaranteed forthe SCR state that a new token will be available if this connection is scheduled the next time. If aconnection uses the DLB mode and no more tokens are left, the shaper returns aTenable when thewhole token bucket can be refilled completely with tokens and the next cell scheduled with PCR.

All theseTenable values ensure, that the context information of a connection will be accessed if andonly if a cell can be scheduled. There is no other updating necessary, e.g., filling new tokens in thebucket. This is efficient since access of context information will result in a memory access replac-ing typically other data currently needed for scheduling in a processor cache. Many current hard-ware implementations have to access context information every cell slot to update the scheduleresulting in poor performance and a limited number of concurrent connections (e.g., Fujitsu ALC[2], LSI ATMizer II [3] and SIEMENS SARE [4]).

The shaper will assist to scheduler by checking if the connection has more tokens left. If no tokensare left and the connection does not use DLB mode, the earliest point in time the scheduler canschedule this connection again is the next theoretical arrival time of a cell in SCR minus a celldelay variation. These parameters are based on the VSA. This calculation includes also implicitlythe transition from PCR state into SCR state for a connection if no more tokens are left for the firsttime.

If a connection uses the DLB mode it will automatically enter the inactive state, but will not beremoved by the scheduler due to the fact that there is still data left to be sent. The earliest point intime the next cell for such a connection can be scheduled is in a(c)*B(c) cell slots. B(c) denotes thebucket capacity, and a(c) the distance between two cells sent with the SCR. Thus, it is guaranteedthat at this point in the future the whole bucket can be filled again with tokens, no intermediateaccess to the context information is necessary for updating.

4.2 Scheduler optimized to preserve Sustained ratesThe scheduler optimized to preserve sustained rates is based on the VSA algorithm. It means thatthe shaping of each connection is based on the Theoretical Arrival Time, the increment I and thelimit L as described in [26]. The shaper needs these parameters for both the PCR and SCR con-tracts to transmit compliant traffic. The duality of the two GCRA algorithms ensures that both theCDVT and the MBS parameter used in traffic contracts can be converted to VSA parameters andincluded in the shaping operation.

The scheduler optimized to preserve sustained rates will always try to send cells atTenable. Thus,for a single connection, the cell schedule of both schedulers will be exactly the same. There willonly be a significant difference when more than one connection competes for a single cell slot cor-responding to theirTenable values. This scheduler will then transmit the cell with the earliestTdead-

line in order to preserve the sustained cell rate of the connections. Preserving the sustained rate

14

means that each cells must be sent no earlier thanTenable and no later than the correspondingTdead-

line, i.e. during the intervaltint=[Tenable, Tdeadline].

The most important consequence for the scheduler implementation is that the ready-list must besorted according to theTdeadline values of all eligible connections independent of theirTenable value.However, no connections should be inserted into the ready-list before theTenable. To solve thisproblem, this scheduler uses an additional data structure denoted future-list for connections whichare not eligible yet.

Overall data structureAn overview of the two main data structures of the prototype implementation is shown in figure 10,i.e. ready-list and future-list. As illustrated, the scheduler implements priority classes by using sep-arate data structures for each class.

The context information blocks of connections which are not ready yet are inserted in the future-listbased on theTenable. These context information blocks include all the necessary parameters for theconnection including shaping parameters and PDU information. The future-list is based on a tim-ing-wheel which makes insertion and removal easy. The drawback is that the wrap-around valuewill limit acceptable bandwidth values. However, bandwidths in the kbit/s range can be supportedwith a reasonable timing-wheel.

connections in the future-list which share aTenable value, are sorted in a hash-tree structure denotedDynamic Hash Tree (DHT) based on theirTdeadline, e.gi-iiii in figure 10. Packet scheduler imple-mentations for bounded-delay services uses often calendar queues[16] to reduce the implementa-tional complexity of a similar problem.

A DHT is a dynamic tree structure consisting of hash-blocks (HB). Each HB contains pointers tonsub-trees splitting the possible range of values in this part of the tree inn sub-ranges. By linking theHBs in a tree structure, a DHT with the top most HB at leveli can containni different values. Inser-tion can be done by hashing at each level down to level 1 which is the lowest level. In all DHT illus-trations in the paper, shaded HBs are at level 2 or higher. Level 1 HBs (white) in the future-list willcontain pointers to the context information blocks. An important feature of the DHT is that eachHB which have only one sub-tree is removed. Thus, the number of HBs in the DHT is only depen-dent of the number of context information blocks in the DHT and their clustering, and not their val-ues. This will also reduce the height of the DHT. Further details of the DHT data structure isdescribed in [28].

Figure 10 Data structure overview

Future-ListPriority = 0

Ready-List

Priority = 0

Sorted on bothTenable and Tdeadline

Sorted on earliestTdeadline

Move operation done by Scheduler

Priority = 1Priority = 2



for all priority classes every cell slota set of connections becomes ready

i

ii iiiiiii

i

iiiiiiiii

during the next cell slot.

future-listready-list

context information block.

15

When the connections in a DHT in the future-list become ready, the entire DHT will be inserted inthe ready-list based on the earliestTdeadline of these connections. Thus, the cost of the move opera-tion is independent of the number of connections in the DHT.

The ready-list is also a DHT. Whereas the “elements” in the future-list DHTs are context informa-tion blocks, the “elements” in the ready-list will be DHTs with differentTenable values. All of themwill be inserted in the ready-list based on the earliestTdeadline value of their connections. Since allconnections in a DHT are sorted on theTdeadline value, the DHT with the earliestTdeadline will con-tain the next connection to schedule. A possible ready-list after the DHTsi-iiii have been moved isshown in the left part of Figure 10. As illustrated, DHTsii andiiii have differentTenable values (seetiming-wheel), but share the earliestTdeadline (see ready-list).

Larger CDVT and MBS parameters in the traffic contract will tend to increase the length of theinterval. This means that theTdeadline values increase and the maximum height of the DHTincreases as well. To keep the height of the ready-list DHT from growing to infinity, the schedulerwill limit the maximum value (i.e. height) which can be inserted in a DHT. It will instead startbuilding new DHTs and link them together [28]. When the first DHT is empty, the ready-list willmove to the next DHT in the link which always will include connections with largerTdeadline val-ues.

The scheduler can find the connection with the earliestTdeadline by traversing the ready-list. Thecontext information block of this connection will be removed from its DHT. If there are more con-text information blocks in this DHT, but the new earliestTdeadline is larger, the DHT will be re-inserted into the ready-list again based on the newTdeadline. If the earliestTdeadline is the same, theDHT will not be moved. Examples of these scenarios are illustrated in Figure 11 a) and b) respec-tively after the indicated connection has been scheduled and the new DHTi_new has been handledcorrectly.

After the necessary operations to transmit the cell have been done, the shaper will recalculate theinterval for the connection and insert the context connection block into the future-list again. Thelast operation done by the scheduler is the move operation for all priority classes which have con-nection which will become ready during the next slot.

4.3 DMA Design IssuesTo efficiently utilize the bandwidth of an internal interconnect, the optimal transfer burst size forthe interconnect should be used. Typical burst sizes range from 16 to 256 byte and often coincidewith the size of a cache line. SCI-based interconnects, for example, support both 64 and 256 bytetransfers [8]. Smaller burst sizes will reduce the achievable bandwidth while larger bursts will tendto block the interconnect and introduce delay. In our approach, the network adapter will maintainper connection buffers of three times the size of the optimal burst size. These memory requirements

Figure 11 Modified ready-list after cell scheduling

i

iiiiiiiii i_new

iiiiiiiiii_new

iiiiiiiii

a) b)

same Tdeadline

new Tdeadline

16

can be traded against host main memory which makes the adapter less expensive at the expense ofsystem load.

The buffers are used in the following way as illustrated in figure 12. At the initialization of a con-nection, the network adapter will try to fill the entire connection buffer as long as the connectionhas enough data. After the data prefetching, the cell level scheduling starts resulting in state 1 infigure 12. The variableb_ptr points to the next data to be sent and wraps around automaticallyafter reaching the end of the cell buffer. As soon asb_ptr crosses the line between a third of thebuffer an appropriate command to the DMA transfer engine is issued (state 2). The DMA transferengine fetches data from the host memory and will refill the connection buffer (state 3) independentof the scheduler operation. Due to varying delay on the system interconnect, the time spent onrefilling the buffer will vary. The delay jitter is outside the control of the adapter and can be caused,e.g., by data transfers between disks and the main memory. Meanwhile the shaper can continuesending data and is therefore decoupled from possible delays on the internal interconnect as long asthe buffer contains data. In the worst case for ATM, 2*transfer size - 47 bytes are left in the bufferwhen the refilling operation starts. If the connection uses an entire 622 Mbit/s network link and theoptimal transfer size is 256 byte, the tolerance will be almost 6 µs.

This mechanism guarantees a smooth and independent operation of the shaper/cell level schedulerand the DMA transfer engine. Most of the current implementations suffer at once if the transfer of acell is delayed. This results in a non-optimal bandwidth utilization and depending on the implemen-tation might also violate the traffic contract due to a violation of the CDVT. If tighter delay boundson the interconnect can be guaranteed, the implementation will also work with a buffer size of onlytwo optimal transfer sizes (depending also on the transfer size itself, i.e. the relation of the transfersize to the ATM cell size).

The segmentation needed from PDUs to cells takes place automatically. The DMA transfer enginereads data from PDUs stored in the host memory, the scheduler selects cells from the buffer fortransmission. To keep track of the PDUs, the context data of a connection contains a list of currentPDUs stored in the main memory. If the host has prepared a new PDU for sending, it informs thescheduler which then appends information about this new PDU at the context information. After aPDU has been sent, the shaper can notify the host unless a polling scheme is used.

Due to the fact that the shaper has to access the context information of a connection anyway, this isalso the right point for the scheduler optimized for burstiness to check if new user data for activeconnections should be prefetched by the DMA transfer engine. Therefore, the shaper keeps trackon the amount of currently stored cells on the adapter of the actual connection, checks if there isspace for new data to be transferred from host memory and issues the appropriate command to theDMA transfer engine if necessary.

Figure 12 Cell buffers on the adapter

1 2 3

optimaltransfersize

b_ptr

b_ptr

b_ptr

Unsent data

17

5 EVALUATIONThe software of the architecture has been implemented and evaluated on state of the art micropro-cessors. It is written in C and is running in user space. Commands to the DMA transfer engine, tothe ATM transmission unit and the server memory controller are issued by the program. We do notsimulate any response from these units since it is not necessary for our evaluation purposes. Thesoftware is otherwise complete and we can observe the generated traffic and measure the capacityof the different algorithms on different processors. In these measurements we have used the perfor-mance monitoring capabilities of the Alpha, Pentium Pro and the UltraSparc processors.These pro-cessors can be configured to count internal hardware events including cycles, instructions, cachemisses and branch mispredictions. The technical report [28] includes a more in-depth description ofthe performance evaluation method for the different processors. To validate our results we used aprocessor simulator [30] for the Alpha processor.

5.1 Scheduler optimized for BurstinessTo evaluate the performance of the implementation we run worst-case scenarios that load the datastructure heavily and require almost always the worst case of updating operations needed for theheap. One such scenario is the setup of 64K simultaneously active connections with identicalparameters. This results in a maximum number of comparisons to reinsert a connection after sched-uling. The algorithm puts no restriction on the number of connections, amount of data, or linkspeed. Only the actual performance of a given CPU/memory system limits this performance. Togive an impression of the performance of this implementation typical configurations were also eval-uated. Note that this implementation treats every connection separately, i.e., no connections arecombined or share common properties as is the case in all existing hardware solutions.

Figure 13 shows cycle counts on a Pentium Pro for 64167 simultaneous active connections eachsending 25000 bytes in 5 PDUs running under WindowsNT 4.0 in the real-time process class. Thebandwidth chosen for the connections does not influence the performance of the algorithm, 9600bit/s were chosen to result in a reasonable aggregated bandwidth of 616 Mbit/s (e.g., for a 622Mbit/s adapter). Overloading the network adapter does not result in a performance degradation butin an increase in delay. This is the best one can expect if a network adapter is overloaded on pur-pose and no cells should be dropped. Most of the cells can be shaped and scheduled within 600CPU cycles. This includes data transfer commands if necessary. Only at some points in time thecycle count goes up to ca. 650 cycles. The technical report [28] shows via measurements on anUltraSparc processor that the jump in the cycle count at the begin is a result of the Pentium Pro L2cache and not the algorithm. 600 CPU cycles represent for the chosen processor a real-time of 3µs(200 MHz CPU speed) which is larger than the target values of 2.7µs for 155Mbps or 680ns for622Mbps, respectively.

To stress the implementation, all connections are started at exactly the same time, i.e., the algorithmtries to prefetch all data for the complete cell buffer at the beginning and then also consequently forevery new PDU of a connection. This results in the first 5 peaks seen in the figure. The last peakresults from removing the connections from the heap. The cycle count in the lower stripe resultfrom the actual scheduling and shaping of single cells, the second higher stripe results fromprefetching of cells for one third of the cell buffer as described in section 4.3. The main result ofthese measurements is not the single number of cycles used, but the fact that the number of cycleshas an upper bound even under worst case conditions and has a very stable behavior for most of thecases.

To compare and evaluate the behavior of the implementation on different architectures we exam-ined the program in detail on a Digital Alpha platform using the Atom toolset [30]. This toolsetallows for detailed instrumentation of programs and can be used for, e.g., dynamic instructioncount, counting function calls, instructions per function, and cache misses. Table 1 shows someresults from the instrumentation. The total amount of cells scheduled in this example was morethan 3 million cells from 64K connections. Every connection had 5 PDUs with 5000 byte each. The

18

functioninsert_new_element is used to insert a new active connection into the heap, the func-tion remove_first_element to remove a connection with no more data to sent from the heap.get_data is the function used to issue data transfer commands to the autonomous DMA transferengine. The optimal transfer size chosen in this example was 256 byte. This results for every PDUin nineteen 256 byte transfers and one transfer with the remaining 136 byte. Altogether 5*64KPDUs were sent in this scenario resulting in 5*20*64K=6.5 million memory transfers commandsfor almost 1.5 Gbyte of data.

The main time consuming functions for the worst case scenario arecheck_first_element with80% and the shaper with 13%.

To compare these numbers with the figures based on the cycle count for the Pentium Pro one has toadd the instructions needed to schedule and shape the cells and divide it by the total number ofcells. The functions contributing to the scheduling, shaping, and issuing data transfer commands

Figure 13 Details of the cycle count on a Pentium Pro/Windows NT sending data for 64K connections

Table 1: Instrumentation of the program

Function name Function calls Instructions Percentage

remove_first_element 64167 29216761 0.1

check_first_element 33623508 17273550318 82.0

get_data 6416700 38500200 0.2

shaper 33687675 2819690475 13.5

schedule_minimum 33687675 875879550 4.1

insert_new_element 64167 1668336 0.1

500

550

600

650

700

0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07

CP

U C

ycle

s

cell slot

64K VCs @ 9600 bit/s

“Stripes”

Peak 1

Peak 2

Peak 3

Peak 4

Peak 5

Final Peak

19

are check_first_element, get_data, shaper, andschedule_minimum. The other func-tions are needed only for setting up the whole data structure and removing connections, respec-tively. The interesting part is when all connections are active and cells are transmitted. Doing thesecalculation results in a instruction count of 624 instructions per cell. Comparing the instructioncount with the number of cycles needed results in a CPI (cycles per instruction) value of about 1.

To see how the algorithm scales for less connections the measurements were performed with 1, 100and 1000 connections. Where the performance results with 64K connections show that with thePentiumPro processor and the standard operating system cells cannot produce fast enough for e.g.155 Mbit/s adapters, the following will show that this is possible for a lower number of connec-tions. Figure 14 shows that a pure software implementation of a shaper/scheduler can produce acell in less than 350 cycles, i.e., 1.75 µs. This is definitely fast enough for a 155 Mbit/s adapter.Again this configuration is a worst case for 1000 connections, the cycle count per cell will drop iffor example only some high bandwidth connections together with a substantial number of lowbandwidth connections have to be scheduled. With this lower number of connections no jump inthe cycle count due to the cache behavior can be seen.

Loading the implementation with only 100 connections results in a cycle count per cell of typicallyless than 300, if only one active connection is configured the cycle count drops to 160. The shapingtakes less than 85 cycles on average and is independent of the size of the data structure.

5.2 Scheduler optimized to preserve Sustained ratesUnlike the scheduler which was described in the previous subsection, the processing cost of thisscheduler does not depend on the number of connections. This scheduler depends instead on thestate of its data structures, i.e. the dynamic behavior of the connections. Thus, a worst case process-ing cost is more difficult to evaluate. To find the worst case scheduling cost, we constructed a trafficscenario where all the different data structures had the worst possible state during a single cell slot.To evaluate a more realistic case, we also constructed a traffic scenario consisting of a mix of CBRand VBR sources.

Worst case operations.The worst case insert and remove operations of a DHT consists of allocating or deallocating twoHBs. During a single cell slot, the scheduler might have to do 2 remove operations and (2+p) insertoperations wherep is the number of priority levels. Thep factor relates to the fact that connectionsin each priority classes in the Future List must be moved to the Ready List when theirTenable isreached. Table 2 includes worst case instruction costs of scheduling 1-4 priority classes.

Figure 14 Pentium Pro Cycle count for 1000 connections

260

280

300

320

340

360

380

400

0 100000 200000 300000 400000 500000 600000

CP

U C

ycle

s

cell slot

1K VCs @ 9600 bit/s

20

The current scheduler prototype does not include the necessary operations to insert data from a newpacket into the Ready List. This operation could be done in the background when the load is suffi-ciently low. The maximum cost of this operation is a single insert operation into the Future List.

The increased cost of introducing an additional priority level is according to the table belowapproximately 230 instructions per added instruction level. This will clearly limit the number ofpriority levels which can be supported. However, the move operations are completely independentand could be done by another processor. Thus, a multiprocessing solution can remove this limita-tion

Some of the bit manipulating operations used by the scheduler are time consuming on general pur-pose microprocessors. However, they can be off-loaded to simple special purpose ASICs on thenetwork adapter, thereby reducing the processing cost significantly. Predictions of the worst casefor a hardware assisted scheduler are presented in Table 2. These approximate numbers were foundmeasuring the overhead cost of the operations which will be offloaded to ASICs.

Although the instruction cost is currently too high for the cell slot of a 622Mbps ATM link, theaverage instruction count will be significantly lower than these numbers. In the following sectionwe will examine a more realistic high load traffic scenario, where the transmission link is fully uti-lized with a mix different traffic types.

Scalability in high traffic load with mixed trafficThe previous section examined the worst case operations of the scheduler. To assess the processingcost and the actual memory usage of the scheduler for a more realistic traffic scenario, we ran someexperiments with a traffic scenario consisting of three different traffic types. The traffic consisted ofa mix of 40% Constant Bit Rate sources (CBR) and 60% Variable Bit Rate sources (VBR). TheVBR sources were split into two traffic types with PCR/SCR ratios of 10 and 100 respectively.These traffic classes are denoted VBR10 and VBR100. The SCR of each connection was set to 1/nth of the link speed which in this case was OC-12 wheren is the number of connections. The MBSof the VBR traffic was set high enough so that these connections could transmit a typical ATMMTU (i.e. 9180 byte) at PCR. We ran experiments with this traffic scenario for configurations from50 to 16K connections. The connections were started in consecutive cell slots starting with theCBR sources and ending with the VBR100 sources. The scheduler optimized to preserve sustainedrates will always choose the connection with the earliestTdeadline first, effectively smoothing outbursts in the VBR connections. The cycle and instruction cost of scheduling the 16K configurationwith mixed traffic is shown in Figure 15. The scheduling pattern is indicated by the text at the bot-tom of the figure. The processing requirements vary depending on the type of traffic because of thestate of the schedulable data structure.

Table 3 shows the average cycle and instruction counts for a configuration of 50, 1K, 4K, 8K and16K connections. The table also includes CPI (cycles per instruction) numbers, Level 2 cache hitrate and the maximum memory usage during the scheduling operation. As expected, the instructioncount increases only slightly from 50 to 16K connection. The cycle count increases more which can

Table 2: Constructed worst case

Prioritiessupported

Worst caseinstructions

Worst case for hardware-assisted

scheduler

1 824 620

2 1053 740

3 1282 870

4 1511 980

21

be related to the decrease in Level 2 cache hit rate when the number of connections increase. For16K, the average cycle count is 842 cycles which is too much for a 622 Mbps ATM adapter. TheCPI numbers increase as well. An interesting observation is that the CPI value of the 4-way super-scalar processor is higher than one. A simple benchmark discussed in [28] shows similar behaviorfor memory intensive operations. The reason is of course that the processor can not do severalmemory operations at the same time, and the gap between memory access time and the CPU speed.An interesting observation is that the number of instructions and cycles are stabilizing. Thus, thenumbers will be similar even for larger configurations. The CPI and Level 2 cache hit rate have alsostabilized which means that with an increased DRAM, the scheduler will be able to handle largerconfiguration with the same cycle performance.

For the 16K configuration, the minimum and maximum cycle counts are 371 and 1600 cyclesrespectively, and 80% of the scheduling operations take less than 600 instructions and less than 950cycles. Detailed analysis showed that an event denoted “CPU Load Stall” has a very high impact onthe cycle count. The CPU Load Stall events occur when an instruction in the execute stage dependson the result of an earlier load which is not yet available. Thus, an increase in the Level 2 cache sizeor an improved memory access time will greatly reduce the occurrence of this event, thus loweringthe cycle count. Experiments discussed in [28] show that if the Level 2 cache size is larger than thescheduling data structure, the CPI value will approach 1 when the scheduler is executing.

Figure 15 Instructions and cycles for a section of the configuration with 16K connections.

Table 3: Algorithm cost versus number of connections with traffic mix CBR 40% VBR10 30% and VBR100 30%

ConfigurationInstructions

AvgCycles

AvgCycles PerInstruction

CacheHit Rate

MemoryUsage[Mb]

50 521 600 1.15 0.99 0.009

1K 559 698 1.25 0.97 0.18

4K 577 793 1.37 0.95 0.73

8K 577 837 1.45 0.93 1.47

16K 578 838 1.45 0.93 2.93

0

500

1000

1500

2000

135000 140000 145000 150000 155000 160000

Cycle

s, In

struc

tions

and

Loa

d Us

e sta

ll

Cell Slot

CyclesInstructions

CBR VBR10 VBR100

22

The memory usage of the scheduler is also shown in the table. The memory usage increases with anincrease in the number of connections. For 16K connections, the memory usage is still below3Mbyte. Generally, memory usage is a function of the traffic requirements and the number of con-nections. The worst case memory usage requires that all connections are schedulable and none ofthem share the same HB. In that case, the maximum number of HBs in use would be2x-1 where xis the number of connections. Since each schedulable connection needs an additional HB to holdthe context information block, the overall HB requirements can be as high as3x-1. The correspond-ing memory usage issizeof(HB)*(3x-1). The HB is 156 bytes large, so for 64K connections thistheoretical maximum memory usage is 30.7MB. Thus, the actual memory usage in our experimentwhich represents a heavy load (sum of SCR = link bandwidth) is only 11% of the worst case mem-ory requirement. This is not surprising since the worst case analysis assumes that the connectionsare not clustered together using the same HBs. It should be noted that some of the fields in the HBdata structures will be obsolete with hardware assistance, reducing the memory usage by an addi-tional 8%.

Varying the size of the HBEach HB contains a portion of a table of pointers. If the number of table entries in each HB isreduced, the DHT would need more levels to hold the sameTdeadline values. However, the memoryusage will also be reduced. In the technical report we have shown the effect of reducing the size ofthe HB from 32 and down to 4 elements. In short, the results for this traffic scenario are that theinstruction count is increasing as expected (i.e. 15-20%). The interesting part is that the cycle countis approximately the same thereby yielding a better CPI. At the same time, the worst case memoryrequirements are reduced by 67%. The actual memory reduction will of course depend on the spe-cific traffic configuration.

5.3 DiscussionThe evaluation of the two algorithms show that one can trade performance against accuracy. Thescheduler optimized for burstiness is based only on the time one can send a cell, i.e.Tenable and noton intervals as the scheduler optimized to preserve sustained rates. One consequence is that thisalgorithm is faster and needs less memory. The critical data structure is 512Kbyte for 64K connec-tions and fits easily in state of the art Level 2 caches. The algorithm is dependent on the number ofconnections, and the worst case is when a large number (64K) of connections must be scheduled(giving the deepest heap). However, the execution time is actually short enough to use this algo-rithm for real shaping and scheduling of 1000 connections on a 155Mbit/s ATM adapter today.Introducing priority classes can be done at the cost of one comparison per class.

The other algorithm can preserve sustainable rates by working on intervals instead of points intime. One result is that the cell schedule will smooth out traffic during overload situations, and pro-vide better service to CBR type of traffic at the expense of bursty traffic. An important feature ofthis algorithm is that the processing requirement is not dependent of the number of connections aslong as they are inside the data structure, which is currently designed to hold at least 64K connec-tions. The cost of this behaviour is higher memory requirements and execution times. However, thecontinuous increase in processing speed, Level 2 cache sizes and memory access times, will even-tually make configurations with 64K connections schedulable in real time also with this algorithm.A hardware assisted scheduler can reduce the overall processing cost significantly by off-loadingcostly bit manipulating operations. The evaluation has shown that the instruction count does notincrease significantly when most of the data structure is in DRAM (ideally in cache). The worstcase for this algorithm depends on the state of the data structures which makes the worst casebehaviour difficult to predict a priori. Introducing priority classes is more costly for this schedulerdue to the inherent complexity of its data structures.

It is important to notice that both algorithms are fully scalable and can easily be partitioned on ashared memory multiprocessor. The shared state is small enough to give an almost linear speed-upfactor for a handful of processors.

23

6 CONCLUSIONThe most important contribution of this paper is the feasibility demonstration of a CPU-basedapproach for scheduling and traffic shaping on a network adapter. More specifically we have illus-trated that this approach:

• can scale to and shape a large number of individual connections,

• can easily shape traffic according to different algorithms, e.g. LBA and VSA

• can schedule cells from many connections by efficiently using available bandwidth

• has predictable behaviour under overload situations

• can coordinate DMA transfer according to the need of individual connections

• is flexible enough to include UBR and ABR traffic as well as variable packet size traffic

• can exceed 155 Mbit/s with state-of-the-art processors in worst case behaviour and considerablebetter for fewer connections and more likely traffic situations.

Another important part of this work is the evaluation of two different schedulers. For light loads,these scheduler will produce similar cell schedules. However, during overload conditions there aresignificant differences.

The scheduler which is optimized to preserve burstiness is the most efficient scheduler in terms ofmemory and processor cycles. We have shown that it can shape 1000 connections simultaneouslyfor a 155 Mbit/s ATM link. It preserves the burstiness of bursty traffic sources, and will be the pre-ferred scheduler for transaction based traffic. The proportional sharing will ensure fairness duringoverload conditions, but the drawback of this scheduler is that traffic streams with little or no toler-ance to cell delay variation will suffer during overload.

The scheduler optimized for sustained rates is less efficient in terms of instructions and processorcycles. In addition, it uses significantly more memory and is thus more vulnerable to cache misses.Currently, general purpose processors are not equipped with large enough Level 2 caches for thisscheduler to scale to 64K connections. The instruction cost of the scheduling operation is alsohigher for the configurations which have been studied in this paper. However, traffic with little orno tolerance to cell delay variation will get better service at the expense of bursty traffic. In over-load conditions, cell streams are better interleaved and traffic is spread out to preserve the sustainedrates of the connection.

An important benefit of both these schedulers is that they will scale with increases in processingspeed and improvements in memory systems, e.g. larger Level 2 caches. It means that future pro-cessor generations will make it possible to use a CPU based approach and either of the schedulersfor larger configurations at higher link rates.

24

REFERENCES

[1] Campbell, A., Aurrecoechea, C.: Hauw, L.: A Review of QoS Architectures, Proceedings of the 4. International IFIP Work-shop on QoS, IWQoS 96, Paris, March 1996

[2] Fujitsu, ALC (MB86687A), http://www.fmi.fujitsu.com

[3] LSI Logic, ATMizer II (L64363), http://www.lsilogic.com

[4] SIEMENS, SARE (PBX4110), http://www.siemens.de

[5] ATecoM, ATM_POL3, http://www.atecom.de

[6] National Semiconductor, BNP2010 UPC, http://www.national.com

[7] McKee, W.: Hitting the Memory Wall: Implications of the Obvious, Computer Architecture News, vol. 23, No. 1 March 1995.

[8] IEEE: Scalable Coherent Interface, IEEE Standard 1596, 1992

[9] Gillett R., Kaufmann, R.: Experience Using the First-Generation Memory Channel for PCI Network, Proceedings of HotInterconnects IV, August, 1996, 205-214.

[10] Horst, R.: TNet: A Reliable System Area Network, IEEE Micro, February 1995, pp. 37-45.

[11] Figueira, N.R., Pasquale, J.: A schedulability condition for deadline-ordered service disciplines, IEEE/ACM Transactions onNetworking, vol 5, nr. 2, pp 232-244. April 1997.

[12] Gopalakrishnan, R., Parulkar, G.: A Framework for QoS Guarantees for Multimedia Applications within an Endsystem, 1.Joint Conference of the Gesellschaft für Informatik and the Schweizer Informatikgesellschaft, Zürich, September 1995

[13] Druschel, P., Banga, G.: Lazy Receiver Processing (LRP): A Network Subsystem Architecture for Server Systems, Proceed-ings of the USENIX Association Second Symposium on Operating Systems Design and Implementation (OSDI ´96), Seattle,Washington, October 1996

[14] LaMaire, R.O., Serpanos, D.N: Two-Dimensional Round-Robin Schedulers for Packet Switches with Multiple Input Queues,IEEE/ACM Transactions on Networking, Vol. 2, No. 5, October 1994.

[15] Georgiadis, L.; Guerin, R., Peris, V., Sivarajan, K.N.: Efficient network QoS provisioning based on per node traffic shaping,IEEE/ACM Trans. Networking, vol. 4, pp. 482-501, August 1996

[16] Wrege, D.E, Liebeherr, J.: A Near-Optimal Packet Scheduler for QoS networks, IEEE Infocom 1997, Kobe, 1997.

[17] Rexford, J., Bonomi, F, Greenberg, A., Wong, A: A Scalable Architecture for Fair Leaky-Bucket Shaping, IEEE Infocom1997, pp. 1056-1064

[18] Toshiba, Meteor (TC35856F), http://www.toshiba.com

[19] TranSwitch, SARA II (TXC-05551), http://www.txc.com

[20] Dalton, C., Watson, G., Banks, D., Calamvokis, C., Edwards, A., Lumley, J.: Afterburner, IEEE Network, July, 1993, pp. 36-43

[21] Druschel, P., Peterson, L.L., Davie, B.S.: Experiences with a high-speed network adapter: a software perspective, ProceedingsACM SIGCOMM 94, London, August 1994

[22] Traw, C.B.S., Smith, J.M.: Hardware/software organization of a high-performance ATM host interface, IEEE Journal onSelected Areas in Communication, vol. 11, no. 2, February 1993, pp.240-253

[23] Coulson, G., Campbell, A., Robin, P., Papathomas, M., Blair, G., Sheperd, D.: The design of a QoS-controlled ATM-basedcommunication system in Chorus, IEEE Journal on Selected Areas in Communications, vol. 13, no. 4, pp. 686-699, May 1995

[24] Engler, D.R., Kaashoek, M.F., O’Toole, J.: Exokernel: an operating system architecture for application-level resource manage-ment, ACM SIGOPS 1995, December 1995

[25] Meleis, H.E., Serpanos, D.N.: Designing communication subsystems for high-speed networks, IEEE Network, Vol. 6, Issue 4.1992, pp 40-46.

[26] ATM Forum: Traffic management specification, version 4.0, ATM Forum/95-0013R10, February 1996

[27] Wright, G., Stevens, R.: TCP/IP Illustrated, Volume 2., Second edition, Addison-Wesley, 1995, ISBN 0-201-63354-X.

[28] DMA Scheduling for High Capacity Data Transfer and Traffic Shaping, Telenor Research and Development, Technical ReportX/97. Available on ftp://www.ifi.uio.no/pub/bryhni/dmatech.ps

[29] Bonomi, F., Fendick, K.W.: The rate-based flow control framework for the available bit rate ATM service, IEEE NetworkMagazine, pp.25-39, March/April 1995.

[30] Digital Equipment Corporation, Program Analysis Using Atom Tools, Digital Equipment Corporation, Maynard, Massachu-setts, March 1996

Documents

Co-Scheduling Of Traffic Shaping And DMA Transfer In A High Performance Network Adapter