A 10GHz TCP offload accelerator for 10Gb/s Ethernet in 90nm dual-V/sub T/ CMOS

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.7

14.7 A 10GHz TCP Offload Accelerator for 10Gb/s Ethernet in 90nm Dual-VT CMOS

Yatin Hoskote, Vasantha Erraguntla, David Finan, Jason Howard,Dan Klowden, Siva Narendra, Greg Ruhl, James Tschanz, Sriram Vangal, Venkat Veeramachaneni, Howard Wilson, Jianping Xu, Nitin Borkar

Microprocessor Research Labs, Intel, Hillsboro, OR

This transmission control protocol (TCP) offload prototype per-forms input processing on minimum size packets at wire speedfor 10Gb/s Ethernet. It uses special purpose hardware, program-mable via a specialized instruction set. The architecture (Fig.14.7.1) consists of a 10GHz high-speed core, fed by 312.5MHzslow-speed memory units that store context information. Thetransmission control block (TCB) stores the context informationfor an existing connection at the same index location that thecontent addressable memory (CAM) in the connection lookupblock (CLB) stores the 96b connection identifier [1]. Parsing ofheaders, connection lookup and loading of connection state intothe working register is done by the input sequencer, CLB andTCB, respectively. The execution core, controlled by instructionsfrom the microcode ROM (µROM), performs the heart of the TCPprocessing. The results are stored back in the TCB and the out-put packet is assembled in the send buffer. The 64-entry TCBcan be viewed as a cache with support for larger number of con-nections in off-chip memory. The 33B of stored context informa-tion for each connection is sufficient to implement the input pro-cessing tasks offloaded in this prototype. The reorder block(ROB) includes two CAMs that are used exclusively to dynami-cally reorder out of order packets, without employing a tradi-tional sorting algorithm.

Briefly, the chip performs connection establishment and teardown, checks message validity, computes payload length,processes incoming flags, performs window management, identi-fies and reorders out of order packets, and assembles responsepackets. A key feature of this architecture is hardware program-mability via a specialized instruction set including instructionsfor efficient TCP processing. This enables quick adaptability toevolving protocols. The complete micro-program consists of 306lines of code. Special instructions enable single-cycle CAM oper-ations, as well as single-cycle 33B-wide TCB reads and writes,allowing 82.5Gb/s data transfer between TCB and core. Theseslow-speed memory operations occur once for every packet, keep-ing performance impact minimal.

For a minimum IP packet size of 84B at 10Gb/s, a new packetarrives every 67.2ns. After reading context information from theTCB, this gives a total of 60.8ns for protocol processing by the core.At 10GHz, this would allow us to execute up to 608 instructions[2]. Simulation traces show that a typical path through the micro-program for in order packets arriving on an established connec-tion is 116 instructions. After including TCB operations andbranch and synchronization penalties, this translates to total pro-cessing time of 32ns, well within the target budget. Processing anout of order packet, which involves execution of ROB operations,can add an additional maximum 19.2ns. This worst-case executionpath still completes in 51.2ns. The core frequency required toachieve wire speed processing is directly related to the smallestpacket size supported. Relaxing restrictions on minimum packetsize allows reduction in the required core frequency, as shown fordifferent Ethernet rates in Fig. 14.7.2.

The execution unit (Fig. 14.7.3) is a 3-stage, 32b, arithmetic logicunit (ALU) operating at 10GHz. It implements add/subtract,

compare, and logical operations in parallel. Instructions arestored in fully decoded format in the µROM. The source and des-tination operands are chosen from among 26 fields of the work-ing register and internal scratch registers. Careful floor plan-ning was required to mitigate the large interconnect penalty androuting congestion. All register fields were split into groupsaccording to bit number. The registers were further split into twohalves, which were aligned with the corresponding bits in theALU to minimize routing distance.

The µROM is a 2 stage, 80b x 320 entry, column-multiplexedarray that uses a wave-pipelined design technique to achieve10GHz performance. The pre-charge and evaluation operationsfor the local bit line, sense stage and global bit line are staggeredwith respect to each other (Fig. 14.7.4) and the final flip-flop cap-ture edge is delayed by one clock phase. The pre-charge phase isoverlapped with address decode to hide pre-charge latency. Theset-dominant latch at the output makes the design robust at lowfrequencies. Branch instructions incur a 2-cycle penalty.

The ROB contains two 54b x 32 entry CAMs (CAML and CAMR)to support dynamic reordering of out of order packets. It isaccessed only if a packet has arrived out of order [3]. Each CAMentry includes sequence number and payload length. CAML holdsthe first sequence number of an out of order packet, CAMR holdsthe last+1 sequence number. Out of order arrival of a new pack-et triggers a lookup in both CAMs to check if the new payload isadjacent to any existing out of order payload. If so, adjacent pay-loads are merged, thereby reducing CAM entries. If CAML is notempty, in order arrival of a packet requires only one lookup inCAML using the last+1 sequence number to check if the succeed-ing payload exists. This method maintains the number of CAMaccesses per packet to be a constant of 2 lookups and 1 write foran out of order packet and at most 1 lookup for an in order pack-et. This is critical to achieve wire speed processing.

The high-speed core uses implicit-pulsed semi-dynamic flip-flops [4]with small clock-to-Q delay and high skew tolerance. In addition,adaptive body bias methodology [5] is applied to all PMOS devices.More than 85% of the device width in the high-speed core is low-VT.Clocking for the design includes two clock source options: an on-diephase-locked loop (PLL) and a secondary clock source which uses anoperational amplifier to convert external differential clock inputs toa single ended clock. Design of the 10GHz clock generation unit andits distribution is shown in Fig. 14.7.5. The simulated worst-caseskew at 10GHz is 4.4ps. A frequency versus Vcc plot characterizingexecution of the high-speed core is shown in Fig. 14.7.6. Simulationsshow that at 25OC and 1.2V, the core functions at 10GHz, with anaverage chip power consumption of 1.9W. Chip layout and summa-ry are displayed in Fig. 14.7.7.

AcknowledgementsThe authors thank K. Ikeda, K. Truong, C. Parsons and H. Nguyen forchip layout, D. Somasekhar and S. Tang for body bias circuitry, the LTDteam for PLL design, and S. Borkar and J. Rattner for encouragement andsupport.

References[1] Information Sciences Institute. Transmission Control Protocol NIC-RFC 793. DDN Protocol Handbook, vol. 2, pp. 2,179-2,198, Sept. 1981.[2] D. Clark, et al., “An Analysis of TCP Processing Overhead,” IEEEComm., vol. 27, pp. 23-29, June 1989.[3] V. Paxson, “End-to-End Internet Packet Dynamics,” ACM SIGCOMM,Sept. 1997.[4] J. Tschanz, et al., “Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops for High-PerformanceMicroprocessors,” ISLPED, pp147-151, 2001.[5] J. Tschanz, et al., “Adaptive Body Bias for Reducing Impacts of Die-to-Die and Within-Die Parameter Variations on Microprocessor Frequencyand Leakage,” ISSCC Dig. Tech. Papers, pp. 422-423, 2002.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

ISSCC 2003 / February 11, 2003 / Salon 8 / 4:45 PM

14

Figure 14.7.1: Top level block diagram. Figure 14.7.2: Packet size vs. core frequency.

Figure 14.7.3: Execution unit organization.

Figure 14.7.5: 10GHz clock generation and distribution. Figure 14.7.6: Core frequency vs. power supply.

Figure 14.7.4: Wave pipelined microcode ROM.

�

��

�� !

"!�

#!��

$�%��!��

��

��&&��#!' �� !

�()

*��+

$�� ,-��

��)��.�(��.�*��!!�� !'+

$�!�

��&&��/�

01

0�

20

#!��

'��!

20

��203�4�5�678

��3�4�20�9�)78

20

��

��

��

10�96��'0�

:��!��' ��

�551

2551

�51

251

56��

951

�91

91

12

56��6��

)�!�� '�8��*�. �'+

��

&��%��!�.

91

12

;

;

955�)78

�678

9�678

5678

$��

�,�)

$�

$�!��

��&&��

<�' �!� ��!�

<��

<�' �!� ��!�

<��

��! ��

&��)

��

��

(�"

$��

�,��

$�

�$

��)

��

:��

��'

��

�$

�)

$�

�)�

�

:��

��'

��

)$

�

$<��

=(=<

>��

��!�'

��!

��!�

�� !�

�� !�

�� '�!'�

' ��

��

��

��

��

��

��

��

��

��

� . . .

. . .

. . .

0

2

0

2

��

��

��

��

��

�

��

��''��

0��.�� !�.

�� !

�!�

"�

�)

��

��

��

��

-��

5�678

?;

?;

?;

��

��

-��

)��

��

��

��!

��&&��

255��

255��

�

9

�

�

1

/

5

0

2

5�1 5�/ � �0 �2 � �9

��

�� 5�0678@��0A

0�/678@��9A


14


Figure 14.7.7: Chip layout and summary.

0�02��2�9��0

/5!��;A�

�)�$

��.@��

0�5B

5678

�9A

�/:�C�0A

25�

��(��

-��''

#! ��!!��

��!'�' ��'

��%��!�.

)��A��

�,��>��

-��!

��

��

-��

��)�!"#

��

��

$��

% �

% "��

#!�� '�%

$�!��&&��


�

��

�� !

"!�

#!��

$�%��!��

��

��&&��#!' �� !

�()

*��+

$�� ,-��

��)��.�(��.�*��!!�� !'+

$�!�

��&&��/�

01

0�

20

#!��

'��!

20

��203�4�5�678

��3�4�20�9�)78

20

��

��

��

10�96��'0�

:��!��' ��

Figure 14.7.1: Top level block diagram.


�551

2551

�51

251

56��

951

�91

91

12

56��6��

)�!�� '�8��*�. �'+

��

&��%��!�.

91

12

;

;

955�)78

�678

9�678

5678

Figure 14.7.2: Packet size vs. core frequency.


$��

�,�)

$�

$�!��

��&&��

<�' �!� ��!�

<��

<�' �!� ��!�

<��

��! ��

&��)

��

��

(�"

$��

�,��

$�

�$

��)

��

:��

��'

��

�$

�)

$�

�)�

�

:��

��'

��

)$

�

Figure 14.7.3: Execution unit organization.


$<��

=(=<

>��

��!�'

��!

��!�

�� !�

�� !�

�� '�!'�

' ��

��

��

��

��

��

��

��

��

��

� . . .

. . .

. . .

0

2

0

2

��

��

��

��

��

�

��

��''��

0��.�� !�.

Figure 14.7.4: Wave pipelined microcode ROM.


�� !

�!�

"�

�)

��

��

��

��

-��

5�678

?;

?;

?;

��

��

-��

)��

��

��

��!

��&&��

255��

255��

�

Figure 14.7.5: 10GHz clock generation and distribution.


9

�

�

1

/

5

0

2

5�1 5�/ � �0 �2 � �9

��

�� 5�0678@��0A

0�/678@��9A

Figure 14.7.6: Core frequency vs. power supply.


0�02��2�9��0

/5!��;A�

�)�$

��.@��

0�5B

5678

�9A

�/:�C�0A

25�

��(��

-��''

#! ��!!��

��!'�' ��'

��%��!�.

)��A��

�,��>��

-��!

��

��

-��

��)�!"#

��

��

$��

% �

% "��

#!�� '�%

$�!��&&��

Figure 14.7.7: Chip layout and summary.

Documents

A 10GHz TCP offload accelerator for 10Gb/s Ethernet in 90nm dual-V/sub T/ CMOS