On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA

On-Chip Interconnect Trend and Design Optimization

Chung-Kuan ChengUC San Diego, La Jolla, CA

Outlines• Global Interconnect Technologies

– RC Trees and Transmission Lines

• Prefix Adder Synthesis– Modeling

• FPGA Interconnect Architecture– Modeling

• Interconnect Architecture– Non-Manhattan Wire Arrangement

2

Interconnect Technologies• Introduction• On-Chip Global Interconnection • Global Wire Modeling• Performance Comparison

3

4

Introduction – Performance Impact Interconnect delay determines the

system performance [ITRS08] 542ps for 1mm minimum pitch Cu global

wire w/o repeater @ 45nm ~150ps for 10 level FO4 delay @ 45nm

[Ho2001] “Future of Wire”

Introduction – Power Dissipation• Interconnects consume a significant portion of power

– 1-2 order larger in magnitude compared with gates• Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07]

– Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04]• About 1/3 burned on the global wires.

5

6

Introduction – Technology Trend• On-Chip Interconnect Scaling

– Dimension shrinks • Wire resistance increases -> RC delay

• Increasing capacitive coupling -> delay, power, noise, etc.

– Performance of global wires decreases w/ technology scaling.

Wire Category Technology Node

90nm 45nm 22nm

M1 Wire

Rw(kohm/mm) 1.914 8.860 34.827

Cw(pF/mm) 0.183 0.157 0.129

Global Wire

Rw(kohm/mm) 0.532 2.970 11.000

Cw(pF/mm) 0.205 0.179 0.151

Copper resistivity versus wire width Scaling trend of PUL wire resistance and capacitance

Organization of On-Chip Global Interconnections

7

Multi-Dimensional Design Consideration

8

Preliminary analysis results assuming 65nm CMOS process.

Application-oriented choice Low LatencyT-TL or UT-TL T-TL or UT-TL -> Single-Ended T-lines-> Single-Ended T-lines High ThroughputR-RCR-RC Low PowerPE-TL or UE-TLPE-TL or UE-TL Low NoisePE-TL or UE-TLPE-TL or UE-TL Low Area/CostR-RCR-RC

Differential T-linesDifferential T-lines

For each architecture, the more area the pentagon covers, the better overall performance is achieved.

On-Chip Global Interconnect Schemes (1)

9

Repeated RC wires (R-RC)

Un-TerminatedUn-Terminated andand Terminated T-Line Terminated T-Line

((UT-TLUT-TL andand T-TL T-TL))

R-RC structure Repeater size/Length of segments Adopt previous design methodology

[Zhang07] UT-TL structure

Full swing at wire-end Tapered inverter chain as TX

T-TL structure Optimize eye-height at wire-end Non-Tapered inverter chain as TX

On-Chip Global Interconnect Schemes (2)

10

Un-Equalized Un-Equalized andand Passive-Equalized T-LinePassive-Equalized T-Line

((UE-TLUE-TL andand PE-TLPE-TL))

Driver side: Tapered differential driver Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain Passive equalizer: parallel RC network Design Constraint: enough eye-opening (50mV) needed at the wire-end

Effects of driver impedance and termination resistance on step response

11

Larger driver impedance leads to slower rise edge and lower saturation voltage Larger termination resistance causes sharper rise edge but with larger reflection

Optimal Rload

Bit-rate: 50Gbps

Rs=11.06ohm, Rd=350ohm, Cd=0.38pF,

RL=107.69ohm

12

Global Wire Modeling – Single-Ended & Differential On-Chip T-lines

13

Determine the bit rate Smallest wire dimensions that satisfy eye constraint Notice PE-TL needs narrower wire -> Equalization helps to increase density.

Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high.

Top-layer thick wires used -> dimension maintains as technology scales. LC-mode behavior dominant

Global Wire Modeling – RC wires and T-lines• RC wire modeling

• T-line 2D-R(f)L(f)C parameter extraction

• T-line Modeling– R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height.

– Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue.14

2D-C Extraction Template2D-C Extraction Template 2D-R(f)L(f) Extraction Template2D-R(f)L(f) Extraction Template

Distributed Π model composed of wire resistance and capacitance

Closed-form equations [Sim03] to calculate 2D wire capacitance

15

Performance Analysis – Definitions • Normalized delay (unit: ps/mm)

– Propagation delay includes wire delay and gate delay.

• Normalized energy per bit (unit: pJ/m)

– Bit rate is assumed to be the inverse of propagation delay for RC wires

• Normalized throughput (unit: Gbps/um)

Performance Analysis – Latency

16

Variables: technology-defined parameters Supply voltage: Vdd (unit: V) Dielectric constant: Min-sized inverter FO4 delay: (unit: ps)

r

R-RC structure (min-d)

is roughly constant

FO4 delay scales w/ scaling factor S

0r

Increasing w/ technology scaling!Increasing w/ technology scaling!

T-line structures Sum of wire delay and TX delay Wire delay TX delay improved w/ FO4 delay

Decreasing w/ technology scaling!Decreasing w/ technology scaling!

21/ , ,nmos w w rc S r S c

r

1/ S

Performance Analysis – Energy per Bit

17

Same variables defined before


Vdd reduces as technology scales reduces as technology scales

Energy decreases w/ technology scaling!Energy decreases w/ technology scaling!

T-line structures

Sum of power consumed on wire and TX. Power of T-line Power of TX circuit

FO4 delay reduces exponentially

Energy decreases w/ larger slope!!Energy decreases w/ larger slope!!

r

2DDV

2DDfCV

Constant !

Performance Analysis – Throughput

18

Same variables defined before


Assuming wire pitch


Throughput increases by Throughput increases by

20% per generation!20% per generation!

T-line structures

TX bandwidth Neglect the minor change of wire pitch

K1 = 0, for UT-TL


Throughput increases by Throughput increases by

43% per generation !!43% per generation !!

1/1/ S

Design Framework for On-Chip T-line Schemes

19

Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure.

Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.

Experimental Settings• Design objective: min-d• Technology nodes: 90nm-22nm• Five different global interconnection structures• Wire length: 5mm • Parameter extraction

– 2D field solver CZ2D from EIP tool suite of IBM– Tabular model or synthesized model

• Transistor models– Predictive transistor model from [Uemura06]– Synopsys level 3 MOSFET model tuned according to ITRS roadmap

• Simulation– HSPICE 2005

• Modeling and Optimization– Linear or non-linear regression/SQP routine– MATLAB 2007

20

Performance Metric: Normalized Delay – Results and Comparison

21

Technology trends R-RC ↑ T-line schemes ↓

T-line structures Outperform R-RC beyond 90nm Single-ended: lowest delay

At 22nm node R-RC: 55ps/mm T-lines: 8ps/mm (85%

reduction) Speed of light: 5ps/mm

Linear model < 6% average percent error

Performance Metric: Normalized Energy per Bit – Results and Comparison

22

Technology trends R-RC and T-lines ↓ T-lines reduce more quickly

T-line structures Outperform R-RC beyond 45nm Differential: lowest energy. Single-ended similar to R-RC.

T-TL > UT-TL

At 22nm node R-RC: 100pJ/m Single-ended: 60% reduction Differential: 96% reduction

Linear model < 12% average percent error Error for T-TL and PE-TL

RL and passive equalizers.

Performance Metric: Normalized Throughput – Results and Comparison

23

Technology trends R-RC and T-lines ↑ T-lines increase more quickly

T-line structures Outperform R-RC beyond 32nm Differential better than single-ended

At 22nm node R-RC: 12Gbps/um T-TL: 30% improvement UE-TL: 75% improvement PE-TL: ~ 2X of R-RC

Linear model < 7% average percent error

Signal Integrity – single-ended T-lines

24

Worst-case switching pattern for peak noise simulationWorst-case switching pattern for peak noise simulation

UT-TL structure 380mV peak noise at 1V supply voltage w/ 7ps rise time SI could be a big issue as supply voltage drops

T-TL less sensitive to noise At the same rise time, ~ 50% reduction of peak noise Peak noise ↓ as technology scales

Using w.c. pattern

Using single or multiple PRBS patterns

Signal Integrity – differential T-lines

25

More reliable Termination resistance Common-mode noise reduction

Peak noise Within ~10mV range

Eye-Heights UE-TL

Eye reduces as bit rate ↑ Harder to meet constraint.

PE-TL > 70mV eye even at 22nm node Equalization does help!

Worst-case switching pattern for peak noise simulationWorst-case switching pattern for peak noise simulation

Summary (cont’)

26

90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm

R-RC 3/35 1/42 1/46 1/55 1/55

UT-TL 5/15 5/13 5/10 5/9 5/8

T-TL 5/15 5/13 5/10 5/9 5/8

UE-TL 1/37 3/25 3/16 3/12 5/8

PE-TL 1/37 3/25 3/16 3/12 5/8

Tech Tech NodeNode

SchemesSchemes


R-RC 5/5 5/6 3/8 3/10 2/12

UT-TL 2/3.3 1/3.3 1/3.3 1/3.3 1/3.3

T-TL 1/3 2/3.4 2/6 2/9 3/16

UE-TL 3/3 3/5 4/9 4/13 4/21

PE-TL 4/4 4/5.3 5/9 5/15 5/24

Tech Tech NodeNode

SchemesSchemes


R-RC 2/150 2/140 1/130 1/100 1/100

UT-TL 3/140 3/110 3/70 3/50 2/40

T-TL 1/260 1/200 2/100 2/60 3/40

UE-TL 4/60 4/36 4/20 4/10 5/4

PE-TL 5/26 5/16 5/8 5/5 5/2

Tech Tech NodeNode

SchemesSchemes


R-RC 1 1 1 1 1

UT-TL 1 1 1 1 1

T-TL 3 3 3 3 3

UE-TL 5 5 4 4 4

PE-TL 4 4 5 5 5

Tech Tech NodeNode

SchemesSchemes

Low-Latency Application (ps/mm) Low-Energy Application (pJ/m)

High-Throughput Application (Gbps/um) Low-Noise Application

Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.

Summary of Global Interconnect

27

Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm.

A simple linear model provided to link Architecture-level performance metrics Technology-defined parameters

Some observations from experimental results T-line structures have potential to replace R-RC at future node Differential T-lines are better than single-ended

Low-power/High-throughput/Low-noise Equalization could be utilized for on-chip global interconnection

Higher throughput density, improve signal integrity Even w/ lower energy dissipation (passive equalizations)

Prefix Adder Synthesis

• Motivation• Prefix Adder Formulation

– Area/Timing/Power Models– Mixed-Radix (2,3,4) Adders– ILP Formulation

• Experimental Results

28

Motivation: Prefix Adder• Increasing impact of physical design• and concern of power.

29

Logical Levels

Wire Tracks

Fanouts

Area

Physical placement

Detail routing

Timing

Gate Cap

Wire Cap

Gate sizingBuffer insertion

Signal slope

Input arrival time

Output require time

Power

Static power

Dynamic power

Power gating

Activity Probability

Prefix Adder Formulation• Input: two n-bit binary numbers

and , one bit carry-in• Output: n-bit sum and one bit

carry out • Prefix Addition: Carry generation &

propagation

011... aaan

011... bbbn

30

0c

011... sssn

nc

)(

:Propagate

:Generate

1

iiii

iiii

iii

iii

bacs

cpgc

bap

bag

Prefix Addition – Formulation

iiiiii bapbag

31

Pre-processing:

Post-processing:

Prefix Computation:

iii

iii

cps

cPGc

0]0:[]0:[1

]:1[]:[]:[

]:1[]:[]:[]:[

kjjiki

kjjijiki

PPP

GPGG

Prefix Adder – Prefix Structure Graph

32

1234

12:13:14:1

gpi

pi

G[i:0]

si

biai

GP[i, j] GP[j-1, k]

GP[i, k]

gp generator

sum generator

GP cell

Pre-processing

Post-processing

Prefix Computation

Area Model

• Distinguish physical placement from logical structure, but keep the bit-slice structure.

33

Logical view Physical view

Bit position

Lo

gica

l leve

l

Bit position

Ph

ysical le

vel

Compact placement

12345678 12345678

Timing Model

• Cell delay calculation:pfd

34

Effort Delay Intrinsic Delay

hgf

Logical EffortElectrical Effort = Cout/Cin=(fanouts+wirelength) / size

Intrinsic properties of the cell

Power Model

• Total power consumption: Dynamic power + Static Power

• Static power: leakage current of devicePsta = *#cells

• Dynamic power: current switching capacitancePdyn = Cload

• is the switching probability = j (j is the logical level*)

35

cellsCjPPP loadstadyntotal # * Vanichayobon S, etc, “Power-speed Trade-off in Parallel Prefix Circuits”

Interval Adjacency Constraint

H1H2H3H4H5H6H7H8

12345678

(7,3): Interval [7,1]

(3,2): Interval [3,1]

(7,2): Interval [7,4]

Must be adjacent,i.e. 4 = 3 + 1

36(column id, logic level)

Linearization for Interval Adjacency Constraint

(i, j)

(i, h) (k1, l1) (k2, l2)

wl wr1 wr2

],[ ),(),(R

hiL

hi yy

37

],[ )1,1()1,1(R

lkL

lk yy ],[ )2,2()2,2(R

lkL

lk yy

],[ ),(),(R

jiL

ji yy

11 if 1),(),( (i,j,k,l) wrwl(i,j,h) yy Llk

Rhi

1 if 1),,,(1),(

),( wl(i,j,h) lkjiwrkylk

Rhi

11 ),,,(1),(

),( wl(i,j,h))(nlkjiwrkylk

Rhi

11 ),,,(1),(

),( wl(i,j,h))(nlkjiwrkylk

Rhi

iyLji ),(

Linearize

Pseudo Linear

Left interval bound equal to column index

ILP Formulation Overview

38

Structure variables: •GP cells•Connections (wires)•Physical positions

Capacitance variables: •Gate cap•Vertical wire cap•Horizontal wire cap

Timing variables: •Input arrival time•Output arrival time

Power Objective

ILPILOG CPLEX

Optimal Solution

Experiments – 16-bit Uniform Timing

39

Experiments – 16-bit Uniform Timing

40

Min-Power Radix-2 Adder (delay= 22, power = 45.5FO4 )

41

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

Min-Power Radix-2&4 Adder (delay=18, power = 29.75FO4 )

42

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

Radix-2 Cell Radix-4 Cell

Min-Power Mixed-Radix Adder (delay=20, power = 28.0FO4)

43

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

Radix-2 Cell Radix-4 Cell Radix-3 Cell

Experiments – 64-bit Hierarchical Structure (Mixed-Radix)

• Handle high bit-width applications• 16x4 and 8x8

ILP Block ILP Block ILP Block ILP Block

ILP Block

a1b1a16b16a17b17a32b32a33b33a48b48a49b49a64b64

…... …... …... …...

Level 1

Level 2

…... …... …... …...

…... …... …... …...

GP*[64:50]GP*[48:34] GP*[32:18] GP*[16:2]

GP*[1:1]GP*[17:17]GP*[33:33]GP*[49:49]

…... …... …...H64 H49 H48 H33 H32 H17 H16 H1

44

FPGA Global Routing Architecture

• Synthesis Flow• Formulation• Experimental Results

45

46

Synthesis Flow

Formulation

Latency

PowerArea

cost

Architecture Design Tradeoffs

47

FPGA Global Routing Architecture

48

Energy Model: Wires • 0.18um tech node, grid length = 0.5mm• 4 types of wires: RC wires with spacing and

transmission

Pw: Per-Bit Wire Energy

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8

Wire Length ( x Grid Length)

En

erg

y (p

J/sw

itch

)

RC 1x

RC 2x

RC 4x

T-line 10x

49

Energy and Area Model: Switch Box

1

2s u s u sP P f P N f P F f 50

Switch Area Model Fs: Number of switches

connected to each wire entering a switch box

f: Total flow incoming a switch box

Ns: Per-bit number of switches inside a switch box

Energy Model Pu: energy of a single switch Ps: Per-bit switch energy

1

2s sN N f F f

W

Topology Generation• Candidate topologies are required for MCF interconnection synthesis

– MCF optimizes flow distribution, but not topology• Huge number of different topologies exists

– A row of 10 cells has 2^C(10, 2) = 2^45 different connections– A 1010 FPGA has (2^45)^20 = 2^900 different topologies!

• Our assumptions– Each row and column has the same connection– Wire lengths are given (e.g. wire length = 1, 2, 4, 8…)– A certain wire length repeats itself till the end of the chip

51

Representative Netlist Generation• Properties of Representative Netlist

– Matches the size of the benchmark netlists• Geometry Distribution Function

– The probability of the distance between two pins decreases exponentially when distance increases

– k: distance between pins – p: probability of distance-1 links– P(k): probability of distance-k links

1( ) (1 ) , 1,2,....kP k p p k

52

MCF Interconnection Synthesis • Integrate multiple wire styles to MCF formulation• Notations

– Wire style parameter: (Pe, Ae), Pe=Pw+Ps

– Area Ar: Routing area on vertical and horizontal dimension

– dj:Communication demand for net j, dj=1

– Flow f(t): flow amount on a steiner tree t

53

MCF Formulation: Energy Optimization

54

Routability constr.

Routing Area constr.

Obj: Min Energy

Experiment Settings• Seven of MCNC benchmark circuits

– Technology mapped to 4-LUTs, each logic block contains 16 4-LUTs

– Size of 10x10 to 11x11 switch boxes, 500 ~ 1000 nets

• Candidate topologies– Available segment length = 1, 2, 4, 8– Total number of candidate topologies: 93

alu4 apex4 diffeq dsip ex5p misex3 tseng

size 11x11 10x10 11x11 11x11 10x10 11x11 10x10

# of nets 621 798 945 593 745 771 788

55

Energy Optimization: Optimized FPGA Routing Architectures

56

Energy Impv:19%Energy Impv:27%Energy Impv:28%

Energy:6.46 x10^3 pJEnergy:5.24 x10^3 pJEnergy:4.74 x10^3 pJEnergy: 4.63 x10^3 pJ

Routing Area: 1500 mRouting Area: 2500 mRouting Area: 3500 mRouting Area: 4500 m

RC 1x

RC 2x

RC 4x

T-Line 10x

Energy Optimization: Impact of Routing Area

• Total energy of the 7 benchmarks with optimized FPGA routing architectures

1.2

1.7

2.2

2.7

3.2

3.7

4.2

4.7

1500 2000 2500 3000 3500 4000 4500

Routing Area (um)

En

erg

y (

x1

0^

3 p

J) alu4

apex4

diffeq

dsip

ex5p

misex3

tseng

57

Interconnect Architecture1. Wire Directions (M, Y, X, E)2. Layout Region (M, D, Y, X)3. Power Ground and Clock Distributions4. Layer Assignment5. Via Arrangement

Comparison1. Wire Length2. Throughput3. Grid vs No-grid

58

(a) A 7 by 7 mesh with Y-architecture

(b) A 7 by 7 mesh with Manhattan-architecture (c) A 7 by 7 mesh with X-architecture

7 by 7 meshes with different interconnect architectures

1. Wire Directions and Models

59

(a) A level 2 hexagonal mesh (b) A level 2 octagonal mesh

(c) A level 2 Diamond mesh

Fig. 10 Meshes with symmetrical structures

2. Layout Regions and Models

60

Length of 2 pin-nets to extend an area

LengthShape

Man. Y-Arch X-Arch Euclidean

M: Diamond

1.250 1.118 1.066 1.016

Y: Hexagon

1.101

X: Octagon

1.055

E: Circle 1.273 1.103 1.055 1.000

E (worst) 1.414 1.155 1.082 1.000

Throughput : concurrent flow demand

ThroughputShape

Manhattan Y-Arch X-Arch*

M: Square 1.000 1.225 1.346

M (Bound) 1.241 1.356

M: Diamond

1.195

Y: Hexagon 1.315

X: Octafon 1.420

*ratio of 0-90 planes and 45-135 planes is not fixed

Flow congestion map for uniform 90 Degree meshes

63

12 by 12 13 by 13

Congestion map of square chip using X-architecture

64

12 by 12 13 by 13

Congestion map of square chip using Y-architecture

65

Explanation For Throughput Increasing

(a) 90-degree routing (b) 45-degree routing

d

d

Number of lines across the vertical center cut-line:

d/D for 90 degree routing

for 45 degree routingDd /2

66

67

68

69

Global Grids (Power/Ground Mesh)

(http://www.xinitiative.org/img/062102forum.pdf)

X-Architecture Y-Architecture

3. Clock Tree on Square Mesh• N-level clock tree:

– path distance =

21% less than H-tree– total wire length =

9% less than H tree, 3% less than X tree

• No self-overlapping between parallel wire segments

71

4. Layer Assignment

I II III IVAssignment

Layer 1

Layer 2

Layer 3

Layer 4

Different routing direction assignment

72

N z(I) z(II) z(III) z(IV)

5 1.02 0.83 0.83 1.01

6 0.97 0.73 0.74 0.97

7 0.94 0.71 0.71 0.93

8 0.90 0.69 0.69 0.90

Normalized throughput of mixed 45-degree and 90-degree mesh with different routing layer assignments

73

Why interleaving Manhattan Layer and Diagonal Layer Improves Throughput?

Shortest path between two points on the plane are always a concatenation of a Manhattan line and a Diagonal line.

(2,0)

(0,3)

Wirelength = 5.0

Wirelength = 3.82

74

Observations

• Routing Direction Assignment Strategies Can Affect the Communication Throughput.

• Interleaving the Manhattan Routing Layers and Diagonal Routing Layers can produce better Throughput

75

5. Via Arrangement: Banks and Tunnels• Use tunnels to detour around vias• Use banks of tunnels to maximize the

throughput• Use bottom k layers to perform intra-cell

routing• Use top n-k layers to distribute signals to the

banks

76

Via-Oriented Interconnect Planning

77


tunnel

78


Full bandwidth

k+2 overhead

#vias= kLOverhead=k+2 verticalTracksL: dimension of the bank

Bank of tunnels

79

Blocking 5 tracks on the layer of 60-degree direction

Tunnel of Y Arch.

80

Tunnels of Y Arch.

81

3.2 Via-Oriented Interconnect Planning

Bank of tunnels

#vias= c1kL

Overhead=k+c2 tracks

82

Conclusion• Global Interconnect Technologies

– EM waves + Devices

• Prefix Adder Synthesis– Formulation + ILP

• FPGA Interconnect Architecture– Formulation + LP

• Interconnect Architecture– Lambda Geometry + Vias

83

Thank you!Q & A

84

Documents

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA