10
ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.7 14.7 A 10GHz TCP Offload Accelerator for 10Gb/s Ethernet in 90nm Dual-V T CMOS Yatin Hoskote, Vasantha Erraguntla, David Finan, Jason Howard, Dan Klowden, Siva Narendra, Greg Ruhl, James Tschanz, Sriram Vangal, Venkat Veeramachaneni, Howard Wilson, Jianping Xu, Nitin Borkar Microprocessor Research Labs, Intel, Hillsboro, OR This transmission control protocol (TCP) offload prototype per- forms input processing on minimum size packets at wire speed for 10Gb/s Ethernet. It uses special purpose hardware, program- mable via a specialized instruction set. The architecture (Fig. 14.7.1) consists of a 10GHz high-speed core, fed by 312.5MHz slow-speed memory units that store context information. The transmission control block (TCB) stores the context information for an existing connection at the same index location that the content addressable memory (CAM) in the connection lookup block (CLB) stores the 96b connection identifier [1]. Parsing of headers, connection lookup and loading of connection state into the working register is done by the input sequencer, CLB and TCB, respectively. The execution core, controlled by instructions from the microcode ROM (µROM), performs the heart of the TCP processing. The results are stored back in the TCB and the out- put packet is assembled in the send buffer. The 64-entry TCB can be viewed as a cache with support for larger number of con- nections in off-chip memory. The 33B of stored context informa- tion for each connection is sufficient to implement the input pro- cessing tasks offloaded in this prototype. The reorder block (ROB) includes two CAMs that are used exclusively to dynami- cally reorder out of order packets, without employing a tradi- tional sorting algorithm. Briefly, the chip performs connection establishment and tear down, checks message validity, computes payload length, processes incoming flags, performs window management, identi- fies and reorders out of order packets, and assembles response packets. A key feature of this architecture is hardware program- mability via a specialized instruction set including instructions for efficient TCP processing. This enables quick adaptability to evolving protocols. The complete micro-program consists of 306 lines of code. Special instructions enable single-cycle CAM oper- ations, as well as single-cycle 33B-wide TCB reads and writes, allowing 82.5Gb/s data transfer between TCB and core. These slow-speed memory operations occur once for every packet, keep- ing performance impact minimal. For a minimum IP packet size of 84B at 10Gb/s, a new packet arrives every 67.2ns. After reading context information from the TCB, this gives a total of 60.8ns for protocol processing by the core. At 10GHz, this would allow us to execute up to 608 instructions [2]. Simulation traces show that a typical path through the micro- program for in order packets arriving on an established connec- tion is 116 instructions. After including TCB operations and branch and synchronization penalties, this translates to total pro- cessing time of 32ns, well within the target budget. Processing an out of order packet, which involves execution of ROB operations, can add an additional maximum 19.2ns. This worst-case execution path still completes in 51.2ns. The core frequency required to achieve wire speed processing is directly related to the smallest packet size supported. Relaxing restrictions on minimum packet size allows reduction in the required core frequency, as shown for different Ethernet rates in Fig. 14.7.2. The execution unit (Fig. 14.7.3) is a 3-stage, 32b, arithmetic logic unit (ALU) operating at 10GHz. It implements add/subtract, compare, and logical operations in parallel. Instructions are stored in fully decoded format in the µROM. The source and des- tination operands are chosen from among 26 fields of the work- ing register and internal scratch registers. Careful floor plan- ning was required to mitigate the large interconnect penalty and routing congestion. All register fields were split into groups according to bit number. The registers were further split into two halves, which were aligned with the corresponding bits in the ALU to minimize routing distance. The µROM is a 2 stage, 80b x 320 entry, column-multiplexed array that uses a wave-pipelined design technique to achieve 10GHz performance. The pre-charge and evaluation operations for the local bit line, sense stage and global bit line are staggered with respect to each other (Fig. 14.7.4) and the final flip-flop cap- ture edge is delayed by one clock phase. The pre-charge phase is overlapped with address decode to hide pre-charge latency. The set-dominant latch at the output makes the design robust at low frequencies. Branch instructions incur a 2-cycle penalty. The ROB contains two 54b x 32 entry CAMs (CAM L and CAM R ) to support dynamic reordering of out of order packets. It is accessed only if a packet has arrived out of order [3]. Each CAM entry includes sequence number and payload length. CAM L holds the first sequence number of an out of order packet, CAM R holds the last+1 sequence number. Out of order arrival of a new pack- et triggers a lookup in both CAMs to check if the new payload is adjacent to any existing out of order payload. If so, adjacent pay- loads are merged, thereby reducing CAM entries. If CAM L is not empty, in order arrival of a packet requires only one lookup in CAM L using the last+1 sequence number to check if the succeed- ing payload exists. This method maintains the number of CAM accesses per packet to be a constant of 2 lookups and 1 write for an out of order packet and at most 1 lookup for an in order pack- et. This is critical to achieve wire speed processing. The high-speed core uses implicit-pulsed semi-dynamic flip-flops [4] with small clock-to-Q delay and high skew tolerance. In addition, adaptive body bias methodology [5] is applied to all PMOS devices. More than 85% of the device width in the high-speed core is low-V T . Clocking for the design includes two clock source options: an on-die phase-locked loop (PLL) and a secondary clock source which uses an operational amplifier to convert external differential clock inputs to a single ended clock. Design of the 10GHz clock generation unit and its distribution is shown in Fig. 14.7.5. The simulated worst-case skew at 10GHz is 4.4ps. A frequency versus V cc plot characterizing execution of the high-speed core is shown in Fig. 14.7.6. Simulations show that at 25 O C and 1.2V, the core functions at 10GHz, with an average chip power consumption of 1.9W. Chip layout and summa- ry are displayed in Fig. 14.7.7. Acknowledgements The authors thank K. Ikeda, K. Truong, C. Parsons and H. Nguyen for chip layout, D. Somasekhar and S. Tang for body bias circuitry, the LTD team for PLL design, and S. Borkar and J. Rattner for encouragement and support. References [1] Information Sciences Institute. Transmission Control Protocol NIC- RFC 793. DDN Protocol Handbook, vol. 2, pp. 2,179-2,198, Sept. 1981. [2] D. Clark, et al., “An Analysis of TCP Processing Overhead,” IEEE Comm., vol. 27, pp. 23-29, June 1989. [3] V. Paxson, “End-to-End Internet Packet Dynamics,” ACM SIGCOMM, Sept. 1997. [4] J. Tschanz, et al., “Comparative Delay and Energy of Single Edge- Triggered & Dual Edge-Triggered Pulsed Flip-Flops for High-Performance Microprocessors,” ISLPED, pp147-151, 2001. [5] J. Tschanz, et al., “Adaptive Body Bias for Reducing Impacts of Die-to- Die and Within-Die Parameter Variations on Microprocessor Frequency and Leakage,” ISSCC Dig. Tech. Papers, pp. 422-423, 2002. 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

A 10GHz TCP offload accelerator for 10Gb/s Ethernet in 90nm dual-V/sub T/ CMOS

  • Upload
    utexas

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.7

14.7 A 10GHz TCP Offload Accelerator for 10Gb/s Ethernet in 90nm Dual-VT CMOS

Yatin Hoskote, Vasantha Erraguntla, David Finan, Jason Howard,Dan Klowden, Siva Narendra, Greg Ruhl, James Tschanz, Sriram Vangal, Venkat Veeramachaneni, Howard Wilson, Jianping Xu, Nitin Borkar

Microprocessor Research Labs, Intel, Hillsboro, OR

This transmission control protocol (TCP) offload prototype per-forms input processing on minimum size packets at wire speedfor 10Gb/s Ethernet. It uses special purpose hardware, program-mable via a specialized instruction set. The architecture (Fig.14.7.1) consists of a 10GHz high-speed core, fed by 312.5MHzslow-speed memory units that store context information. Thetransmission control block (TCB) stores the context informationfor an existing connection at the same index location that thecontent addressable memory (CAM) in the connection lookupblock (CLB) stores the 96b connection identifier [1]. Parsing ofheaders, connection lookup and loading of connection state intothe working register is done by the input sequencer, CLB andTCB, respectively. The execution core, controlled by instructionsfrom the microcode ROM (µROM), performs the heart of the TCPprocessing. The results are stored back in the TCB and the out-put packet is assembled in the send buffer. The 64-entry TCBcan be viewed as a cache with support for larger number of con-nections in off-chip memory. The 33B of stored context informa-tion for each connection is sufficient to implement the input pro-cessing tasks offloaded in this prototype. The reorder block(ROB) includes two CAMs that are used exclusively to dynami-cally reorder out of order packets, without employing a tradi-tional sorting algorithm.

Briefly, the chip performs connection establishment and teardown, checks message validity, computes payload length,processes incoming flags, performs window management, identi-fies and reorders out of order packets, and assembles responsepackets. A key feature of this architecture is hardware program-mability via a specialized instruction set including instructionsfor efficient TCP processing. This enables quick adaptability toevolving protocols. The complete micro-program consists of 306lines of code. Special instructions enable single-cycle CAM oper-ations, as well as single-cycle 33B-wide TCB reads and writes,allowing 82.5Gb/s data transfer between TCB and core. Theseslow-speed memory operations occur once for every packet, keep-ing performance impact minimal.

For a minimum IP packet size of 84B at 10Gb/s, a new packetarrives every 67.2ns. After reading context information from theTCB, this gives a total of 60.8ns for protocol processing by the core.At 10GHz, this would allow us to execute up to 608 instructions[2]. Simulation traces show that a typical path through the micro-program for in order packets arriving on an established connec-tion is 116 instructions. After including TCB operations andbranch and synchronization penalties, this translates to total pro-cessing time of 32ns, well within the target budget. Processing anout of order packet, which involves execution of ROB operations,can add an additional maximum 19.2ns. This worst-case executionpath still completes in 51.2ns. The core frequency required toachieve wire speed processing is directly related to the smallestpacket size supported. Relaxing restrictions on minimum packetsize allows reduction in the required core frequency, as shown fordifferent Ethernet rates in Fig. 14.7.2.

The execution unit (Fig. 14.7.3) is a 3-stage, 32b, arithmetic logicunit (ALU) operating at 10GHz. It implements add/subtract,

compare, and logical operations in parallel. Instructions arestored in fully decoded format in the µROM. The source and des-tination operands are chosen from among 26 fields of the work-ing register and internal scratch registers. Careful floor plan-ning was required to mitigate the large interconnect penalty androuting congestion. All register fields were split into groupsaccording to bit number. The registers were further split into twohalves, which were aligned with the corresponding bits in theALU to minimize routing distance.

The µROM is a 2 stage, 80b x 320 entry, column-multiplexedarray that uses a wave-pipelined design technique to achieve10GHz performance. The pre-charge and evaluation operationsfor the local bit line, sense stage and global bit line are staggeredwith respect to each other (Fig. 14.7.4) and the final flip-flop cap-ture edge is delayed by one clock phase. The pre-charge phase isoverlapped with address decode to hide pre-charge latency. Theset-dominant latch at the output makes the design robust at lowfrequencies. Branch instructions incur a 2-cycle penalty.

The ROB contains two 54b x 32 entry CAMs (CAML and CAMR)to support dynamic reordering of out of order packets. It isaccessed only if a packet has arrived out of order [3]. Each CAMentry includes sequence number and payload length. CAML holdsthe first sequence number of an out of order packet, CAMR holdsthe last+1 sequence number. Out of order arrival of a new pack-et triggers a lookup in both CAMs to check if the new payload isadjacent to any existing out of order payload. If so, adjacent pay-loads are merged, thereby reducing CAM entries. If CAML is notempty, in order arrival of a packet requires only one lookup inCAML using the last+1 sequence number to check if the succeed-ing payload exists. This method maintains the number of CAMaccesses per packet to be a constant of 2 lookups and 1 write foran out of order packet and at most 1 lookup for an in order pack-et. This is critical to achieve wire speed processing.

The high-speed core uses implicit-pulsed semi-dynamic flip-flops [4]with small clock-to-Q delay and high skew tolerance. In addition,adaptive body bias methodology [5] is applied to all PMOS devices.More than 85% of the device width in the high-speed core is low-VT.Clocking for the design includes two clock source options: an on-diephase-locked loop (PLL) and a secondary clock source which uses anoperational amplifier to convert external differential clock inputs toa single ended clock. Design of the 10GHz clock generation unit andits distribution is shown in Fig. 14.7.5. The simulated worst-caseskew at 10GHz is 4.4ps. A frequency versus Vcc plot characterizingexecution of the high-speed core is shown in Fig. 14.7.6. Simulationsshow that at 25OC and 1.2V, the core functions at 10GHz, with anaverage chip power consumption of 1.9W. Chip layout and summa-ry are displayed in Fig. 14.7.7.

AcknowledgementsThe authors thank K. Ikeda, K. Truong, C. Parsons and H. Nguyen forchip layout, D. Somasekhar and S. Tang for body bias circuitry, the LTDteam for PLL design, and S. Borkar and J. Rattner for encouragement andsupport.

References[1] Information Sciences Institute. Transmission Control Protocol NIC-RFC 793. DDN Protocol Handbook, vol. 2, pp. 2,179-2,198, Sept. 1981.[2] D. Clark, et al., “An Analysis of TCP Processing Overhead,” IEEEComm., vol. 27, pp. 23-29, June 1989.[3] V. Paxson, “End-to-End Internet Packet Dynamics,” ACM SIGCOMM,Sept. 1997.[4] J. Tschanz, et al., “Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops for High-PerformanceMicroprocessors,” ISLPED, pp147-151, 2001.[5] J. Tschanz, et al., “Adaptive Body Bias for Reducing Impacts of Die-to-Die and Within-Die Parameter Variations on Microprocessor Frequencyand Leakage,” ISSCC Dig. Tech. Papers, pp. 422-423, 2002.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

ISSCC 2003 / February 11, 2003 / Salon 8 / 4:45 PM

14

Figure 14.7.1: Top level block diagram. Figure 14.7.2: Packet size vs. core frequency.

Figure 14.7.3: Execution unit organization.

Figure 14.7.5: 10GHz clock generation and distribution. Figure 14.7.6: Core frequency vs. power supply.

Figure 14.7.4: Wave pipelined microcode ROM.

���

�������� ��!

"!�

#!��

$�%��!����

�������

��&&��#!' ��� ��!

�()

*�����+

$��� �,-��

����)����.�(���.�*����!!�� ��!'+

$�!�

��&&��/�

01

0�

20

#!�� ��� �

'����!

20

���203�4�5�678

���3�4�20�9�)78

20

�� ��

�� �

����

10�96��'0�

:����!������' ��

�551

2551

�51

251

56��

951

�91

91

12

56��6��

)�!���������� �'�8��*�. �'+

�����

&��%��!�.

91

12

;

;

955�)78

�678

9�678

5678

$���

�,�)

$�

$�!��

��&&��

<�' �!� ��!�

<�����

<�' �!� ��!�

<�����

��! ����

&�������)

������ �

������ �

(�"

$���

�,��

$�

�$

��)

��

:���

�����'

���

�$

�)

$�

�)�

:���

�����'

���

)$

$<��

=(=<

>���

��!�'

�����!

����!�

�������� ���!�

��������� ���!�

��� '�!'�

' ���

���

�������

���

�������

���

�������� ���

�������

��� ����

����

� . . .

. . .

. . .

0

2

0

2

������������

���

��������

���

��������

�� ���������������

�����''�������

0��.������ �!�.

����� ��!

�!�

"�

�)

��

��

��

���

-��

5�678

?;

?;

?;

�����

�����

-��

)�� ��

�����

������������

�������!

��&&��

255���

255��

9

1

/

5

0

2

5�1 5�/ � �0 �2 � �9

��� ���

��������������� 5�0678@��0A

0�/678@��9A

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

14

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

Figure 14.7.7: Chip layout and summary.

0�02���2�9��0

/5!������;A�

�)�$

����.@����� ��

0�5B

5678

�9A

�/:�C�0A

25�

�����(���

-����''

#! ����!!��

���!'�' ��'

���%��!�.

)���A��

�,�����>��

-������!

��� ����

����

-��

��)�!"#

��� ����

����

$��

% �

% "���

#!�� '�%

$�!���&&��

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

���

�������� ��!

"!�

#!��

$�%��!����

�������

��&&��#!' ��� ��!

�()

*�����+

$��� �,-��

����)����.�(���.�*����!!�� ��!'+

$�!�

��&&��/�

01

0�

20

#!�� ��� �

'����!

20

���203�4�5�678

���3�4�20�9�)78

20

�� ��

�� �

����

10�96��'0�

:����!������' ��

Figure 14.7.1: Top level block diagram.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

�551

2551

�51

251

56��

951

�91

91

12

56��6��

)�!���������� �'�8��*�. �'+

�����

&��%��!�.

91

12

;

;

955�)78

�678

9�678

5678

Figure 14.7.2: Packet size vs. core frequency.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

$���

�,�)

$�

$�!��

��&&��

<�' �!� ��!�

<�����

<�' �!� ��!�

<�����

��! ����

&�������)

������ �

������ �

(�"

$���

�,��

$�

�$

��)

��

:���

�����'

���

�$

�)

$�

�)�

:���

�����'

���

)$

Figure 14.7.3: Execution unit organization.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

$<��

=(=<

>���

��!�'

�����!

����!�

�������� ���!�

��������� ���!�

��� '�!'�

' ���

���

�������

���

�������

���

�������� ���

�������

��� ����

����

� . . .

. . .

. . .

0

2

0

2

������������

���

��������

���

��������

�� ���������������

�����''�������

0��.������ �!�.

Figure 14.7.4: Wave pipelined microcode ROM.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

����� ��!

�!�

"�

�)

��

��

��

���

-��

5�678

?;

?;

?;

�����

�����

-��

)�� ��

�����

������������

�������!

��&&��

255���

255��

Figure 14.7.5: 10GHz clock generation and distribution.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

9

1

/

5

0

2

5�1 5�/ � �0 �2 � �9

��� ���

��������������� 5�0678@��0A

0�/678@��9A

Figure 14.7.6: Core frequency vs. power supply.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

0�02���2�9��0

/5!������;A�

�)�$

����.@����� ��

0�5B

5678

�9A

�/:�C�0A

25�

�����(���

-����''

#! ����!!��

���!'�' ��'

���%��!�.

)���A��

�,�����>��

-������!

��� ����

����

-��

��)�!"#

��� ����

����

$��

% �

% "���

#!�� '�%

$�!���&&��

Figure 14.7.7: Chip layout and summary.