Upload
george-jessie-freeman
View
226
Download
4
Tags:
Embed Size (px)
Citation preview
On-Chip Interconnect Trend and Design Optimization
Chung-Kuan ChengUC San Diego, La Jolla, CA
Outlines• Global Interconnect Technologies
– RC Trees and Transmission Lines
• Prefix Adder Synthesis– Modeling
• FPGA Interconnect Architecture– Modeling
• Interconnect Architecture– Non-Manhattan Wire Arrangement
2
Interconnect Technologies• Introduction• On-Chip Global Interconnection • Global Wire Modeling• Performance Comparison
3
4
Introduction – Performance Impact Interconnect delay determines the
system performance [ITRS08] 542ps for 1mm minimum pitch Cu global
wire w/o repeater @ 45nm ~150ps for 10 level FO4 delay @ 45nm
[Ho2001] “Future of Wire”
Introduction – Power Dissipation• Interconnects consume a significant portion of power
– 1-2 order larger in magnitude compared with gates• Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07]
– Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04]• About 1/3 burned on the global wires.
5
6
Introduction – Technology Trend• On-Chip Interconnect Scaling
– Dimension shrinks • Wire resistance increases -> RC delay
• Increasing capacitive coupling -> delay, power, noise, etc.
– Performance of global wires decreases w/ technology scaling.
Wire Category Technology Node
90nm 45nm 22nm
M1 Wire
Rw(kohm/mm) 1.914 8.860 34.827
Cw(pF/mm) 0.183 0.157 0.129
Global Wire
Rw(kohm/mm) 0.532 2.970 11.000
Cw(pF/mm) 0.205 0.179 0.151
Copper resistivity versus wire width Scaling trend of PUL wire resistance and capacitance
Organization of On-Chip Global Interconnections
7
Multi-Dimensional Design Consideration
8
Preliminary analysis results assuming 65nm CMOS process.
Application-oriented choice Low LatencyT-TL or UT-TL T-TL or UT-TL -> Single-Ended T-lines-> Single-Ended T-lines High ThroughputR-RCR-RC Low PowerPE-TL or UE-TLPE-TL or UE-TL Low NoisePE-TL or UE-TLPE-TL or UE-TL Low Area/CostR-RCR-RC
Differential T-linesDifferential T-lines
For each architecture, the more area the pentagon covers, the better overall performance is achieved.
On-Chip Global Interconnect Schemes (1)
9
Repeated RC wires (R-RC)
Un-TerminatedUn-Terminated andand Terminated T-Line Terminated T-Line
((UT-TLUT-TL andand T-TL T-TL))
R-RC structure Repeater size/Length of segments Adopt previous design methodology
[Zhang07] UT-TL structure
Full swing at wire-end Tapered inverter chain as TX
T-TL structure Optimize eye-height at wire-end Non-Tapered inverter chain as TX
On-Chip Global Interconnect Schemes (2)
10
Un-Equalized Un-Equalized andand Passive-Equalized T-LinePassive-Equalized T-Line
((UE-TLUE-TL andand PE-TLPE-TL))
Driver side: Tapered differential driver Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain Passive equalizer: parallel RC network Design Constraint: enough eye-opening (50mV) needed at the wire-end
Effects of driver impedance and termination resistance on step response
11
Larger driver impedance leads to slower rise edge and lower saturation voltage Larger termination resistance causes sharper rise edge but with larger reflection
Optimal Rload
Bit-rate: 50Gbps
Rs=11.06ohm, Rd=350ohm, Cd=0.38pF,
RL=107.69ohm
12
Global Wire Modeling – Single-Ended & Differential On-Chip T-lines
13
Determine the bit rate Smallest wire dimensions that satisfy eye constraint Notice PE-TL needs narrower wire -> Equalization helps to increase density.
Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high.
Top-layer thick wires used -> dimension maintains as technology scales. LC-mode behavior dominant
Global Wire Modeling – RC wires and T-lines• RC wire modeling
• T-line 2D-R(f)L(f)C parameter extraction
• T-line Modeling– R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height.
– Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue.14
2D-C Extraction Template2D-C Extraction Template 2D-R(f)L(f) Extraction Template2D-R(f)L(f) Extraction Template
Distributed Π model composed of wire resistance and capacitance
Closed-form equations [Sim03] to calculate 2D wire capacitance
15
Performance Analysis – Definitions • Normalized delay (unit: ps/mm)
– Propagation delay includes wire delay and gate delay.
• Normalized energy per bit (unit: pJ/m)
– Bit rate is assumed to be the inverse of propagation delay for RC wires
• Normalized throughput (unit: Gbps/um)
Performance Analysis – Latency
16
Variables: technology-defined parameters Supply voltage: Vdd (unit: V) Dielectric constant: Min-sized inverter FO4 delay: (unit: ps)
r
R-RC structure (min-d)
is roughly constant
FO4 delay scales w/ scaling factor S
0r
Increasing w/ technology scaling!Increasing w/ technology scaling!
T-line structures Sum of wire delay and TX delay Wire delay TX delay improved w/ FO4 delay
Decreasing w/ technology scaling!Decreasing w/ technology scaling!
21/ , ,nmos w w rc S r S c
r
1/ S
Performance Analysis – Energy per Bit
17
Same variables defined before
R-RC structure (min-d)
Vdd reduces as technology scales reduces as technology scales
Energy decreases w/ technology scaling!Energy decreases w/ technology scaling!
T-line structures
Sum of power consumed on wire and TX. Power of T-line Power of TX circuit
FO4 delay reduces exponentially
Energy decreases w/ larger slope!!Energy decreases w/ larger slope!!
r
2DDV
2DDfCV
Constant !
Performance Analysis – Throughput
18
Same variables defined before
R-RC structure (min-d)
Assuming wire pitch
FO4 delay reduces exponentially
Throughput increases by Throughput increases by
20% per generation!20% per generation!
T-line structures
TX bandwidth Neglect the minor change of wire pitch
K1 = 0, for UT-TL
FO4 delay reduces exponentially
Throughput increases by Throughput increases by
43% per generation !!43% per generation !!
1/1/ S
Design Framework for On-Chip T-line Schemes
19
Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure.
Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.
Experimental Settings• Design objective: min-d• Technology nodes: 90nm-22nm• Five different global interconnection structures• Wire length: 5mm • Parameter extraction
– 2D field solver CZ2D from EIP tool suite of IBM– Tabular model or synthesized model
• Transistor models– Predictive transistor model from [Uemura06]– Synopsys level 3 MOSFET model tuned according to ITRS roadmap
• Simulation– HSPICE 2005
• Modeling and Optimization– Linear or non-linear regression/SQP routine– MATLAB 2007
20
Performance Metric: Normalized Delay – Results and Comparison
21
Technology trends R-RC ↑ T-line schemes ↓
T-line structures Outperform R-RC beyond 90nm Single-ended: lowest delay
At 22nm node R-RC: 55ps/mm T-lines: 8ps/mm (85%
reduction) Speed of light: 5ps/mm
Linear model < 6% average percent error
Performance Metric: Normalized Energy per Bit – Results and Comparison
22
Technology trends R-RC and T-lines ↓ T-lines reduce more quickly
T-line structures Outperform R-RC beyond 45nm Differential: lowest energy. Single-ended similar to R-RC.
T-TL > UT-TL
At 22nm node R-RC: 100pJ/m Single-ended: 60% reduction Differential: 96% reduction
Linear model < 12% average percent error Error for T-TL and PE-TL
RL and passive equalizers.
Performance Metric: Normalized Throughput – Results and Comparison
23
Technology trends R-RC and T-lines ↑ T-lines increase more quickly
T-line structures Outperform R-RC beyond 32nm Differential better than single-ended
At 22nm node R-RC: 12Gbps/um T-TL: 30% improvement UE-TL: 75% improvement PE-TL: ~ 2X of R-RC
Linear model < 7% average percent error
Signal Integrity – single-ended T-lines
24
Worst-case switching pattern for peak noise simulationWorst-case switching pattern for peak noise simulation
UT-TL structure 380mV peak noise at 1V supply voltage w/ 7ps rise time SI could be a big issue as supply voltage drops
T-TL less sensitive to noise At the same rise time, ~ 50% reduction of peak noise Peak noise ↓ as technology scales
Using w.c. pattern
Using single or multiple PRBS patterns
Signal Integrity – differential T-lines
25
More reliable Termination resistance Common-mode noise reduction
Peak noise Within ~10mV range
Eye-Heights UE-TL
Eye reduces as bit rate ↑ Harder to meet constraint.
PE-TL > 70mV eye even at 22nm node Equalization does help!
Worst-case switching pattern for peak noise simulationWorst-case switching pattern for peak noise simulation
Summary (cont’)
26
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 3/35 1/42 1/46 1/55 1/55
UT-TL 5/15 5/13 5/10 5/9 5/8
T-TL 5/15 5/13 5/10 5/9 5/8
UE-TL 1/37 3/25 3/16 3/12 5/8
PE-TL 1/37 3/25 3/16 3/12 5/8
Tech Tech NodeNode
SchemesSchemes
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 5/5 5/6 3/8 3/10 2/12
UT-TL 2/3.3 1/3.3 1/3.3 1/3.3 1/3.3
T-TL 1/3 2/3.4 2/6 2/9 3/16
UE-TL 3/3 3/5 4/9 4/13 4/21
PE-TL 4/4 4/5.3 5/9 5/15 5/24
Tech Tech NodeNode
SchemesSchemes
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 2/150 2/140 1/130 1/100 1/100
UT-TL 3/140 3/110 3/70 3/50 2/40
T-TL 1/260 1/200 2/100 2/60 3/40
UE-TL 4/60 4/36 4/20 4/10 5/4
PE-TL 5/26 5/16 5/8 5/5 5/2
Tech Tech NodeNode
SchemesSchemes
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 1 1 1 1 1
UT-TL 1 1 1 1 1
T-TL 3 3 3 3 3
UE-TL 5 5 4 4 4
PE-TL 4 4 5 5 5
Tech Tech NodeNode
SchemesSchemes
Low-Latency Application (ps/mm) Low-Energy Application (pJ/m)
High-Throughput Application (Gbps/um) Low-Noise Application
Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.
Summary of Global Interconnect
27
Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm.
A simple linear model provided to link Architecture-level performance metrics Technology-defined parameters
Some observations from experimental results T-line structures have potential to replace R-RC at future node Differential T-lines are better than single-ended
Low-power/High-throughput/Low-noise Equalization could be utilized for on-chip global interconnection
Higher throughput density, improve signal integrity Even w/ lower energy dissipation (passive equalizations)
Prefix Adder Synthesis
• Motivation• Prefix Adder Formulation
– Area/Timing/Power Models– Mixed-Radix (2,3,4) Adders– ILP Formulation
• Experimental Results
28
Motivation: Prefix Adder• Increasing impact of physical design• and concern of power.
29
Logical Levels
Wire Tracks
Fanouts
Area
Physical placement
Detail routing
Timing
Gate Cap
Wire Cap
Gate sizingBuffer insertion
Signal slope
Input arrival time
Output require time
Power
Static power
Dynamic power
Power gating
Activity Probability
Prefix Adder Formulation• Input: two n-bit binary numbers
and , one bit carry-in• Output: n-bit sum and one bit
carry out • Prefix Addition: Carry generation &
propagation
011... aaan
011... bbbn
30
0c
011... sssn
nc
)(
:Propagate
:Generate
1
iiii
iiii
iii
iii
bacs
cpgc
bap
bag
Prefix Addition – Formulation
iiiiii bapbag
31
Pre-processing:
Post-processing:
Prefix Computation:
iii
iii
cps
cPGc
0]0:[]0:[1
]:1[]:[]:[
]:1[]:[]:[]:[
kjjiki
kjjijiki
PPP
GPGG
Prefix Adder – Prefix Structure Graph
32
1234
12:13:14:1
gpi
pi
G[i:0]
si
biai
GP[i, j] GP[j-1, k]
GP[i, k]
gp generator
sum generator
GP cell
Pre-processing
Post-processing
Prefix Computation
Area Model
• Distinguish physical placement from logical structure, but keep the bit-slice structure.
33
Logical view Physical view
Bit position
Lo
gica
l leve
l
Bit position
Ph
ysical le
vel
Compact placement
12345678 12345678
Timing Model
• Cell delay calculation:pfd
34
Effort Delay Intrinsic Delay
hgf
Logical EffortElectrical Effort = Cout/Cin=(fanouts+wirelength) / size
Intrinsic properties of the cell
Power Model
• Total power consumption: Dynamic power + Static Power
• Static power: leakage current of devicePsta = *#cells
• Dynamic power: current switching capacitancePdyn = Cload
• is the switching probability = j (j is the logical level*)
35
cellsCjPPP loadstadyntotal # * Vanichayobon S, etc, “Power-speed Trade-off in Parallel Prefix Circuits”
Interval Adjacency Constraint
H1H2H3H4H5H6H7H8
12345678
(7,3): Interval [7,1]
(3,2): Interval [3,1]
(7,2): Interval [7,4]
Must be adjacent,i.e. 4 = 3 + 1
36(column id, logic level)
Linearization for Interval Adjacency Constraint
(i, j)
(i, h) (k1, l1) (k2, l2)
wl wr1 wr2
],[ ),(),(R
hiL
hi yy
37
],[ )1,1()1,1(R
lkL
lk yy ],[ )2,2()2,2(R
lkL
lk yy
],[ ),(),(R
jiL
ji yy
11 if 1),(),( (i,j,k,l) wrwl(i,j,h) yy Llk
Rhi
1 if 1),,,(1),(
),( wl(i,j,h) lkjiwrkylk
Rhi
11 ),,,(1),(
),( wl(i,j,h))(nlkjiwrkylk
Rhi
11 ),,,(1),(
),( wl(i,j,h))(nlkjiwrkylk
Rhi
iyLji ),(
Linearize
Pseudo Linear
Left interval bound equal to column index
ILP Formulation Overview
38
Structure variables: •GP cells•Connections (wires)•Physical positions
Capacitance variables: •Gate cap•Vertical wire cap•Horizontal wire cap
Timing variables: •Input arrival time•Output arrival time
Power Objective
ILPILOG CPLEX
Optimal Solution
Experiments – 16-bit Uniform Timing
39
Experiments – 16-bit Uniform Timing
40
Min-Power Radix-2 Adder (delay= 22, power = 45.5FO4 )
41
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
Min-Power Radix-2&4 Adder (delay=18, power = 29.75FO4 )
42
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
Radix-2 Cell Radix-4 Cell
Min-Power Mixed-Radix Adder (delay=20, power = 28.0FO4)
43
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
Radix-2 Cell Radix-4 Cell Radix-3 Cell
Experiments – 64-bit Hierarchical Structure (Mixed-Radix)
• Handle high bit-width applications• 16x4 and 8x8
ILP Block ILP Block ILP Block ILP Block
ILP Block
a1b1a16b16a17b17a32b32a33b33a48b48a49b49a64b64
…... …... …... …...
Level 1
Level 2
…... …... …... …...
…... …... …... …...
GP*[64:50]GP*[48:34] GP*[32:18] GP*[16:2]
GP*[1:1]GP*[17:17]GP*[33:33]GP*[49:49]
…... …... …...H64 H49 H48 H33 H32 H17 H16 H1
44
FPGA Global Routing Architecture
• Synthesis Flow• Formulation• Experimental Results
45
46
Synthesis Flow
Formulation
Latency
PowerArea
cost
Architecture Design Tradeoffs
47
FPGA Global Routing Architecture
48
Energy Model: Wires • 0.18um tech node, grid length = 0.5mm• 4 types of wires: RC wires with spacing and
transmission
Pw: Per-Bit Wire Energy
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8
Wire Length ( x Grid Length)
En
erg
y (p
J/sw
itch
)
RC 1x
RC 2x
RC 4x
T-line 10x
49
Energy and Area Model: Switch Box
1
2s u s u sP P f P N f P F f 50
Switch Area Model Fs: Number of switches
connected to each wire entering a switch box
f: Total flow incoming a switch box
Ns: Per-bit number of switches inside a switch box
Energy Model Pu: energy of a single switch Ps: Per-bit switch energy
1
2s sN N f F f
W
Topology Generation• Candidate topologies are required for MCF interconnection synthesis
– MCF optimizes flow distribution, but not topology• Huge number of different topologies exists
– A row of 10 cells has 2^C(10, 2) = 2^45 different connections– A 1010 FPGA has (2^45)^20 = 2^900 different topologies!
• Our assumptions– Each row and column has the same connection– Wire lengths are given (e.g. wire length = 1, 2, 4, 8…)– A certain wire length repeats itself till the end of the chip
51
Representative Netlist Generation• Properties of Representative Netlist
– Matches the size of the benchmark netlists• Geometry Distribution Function
– The probability of the distance between two pins decreases exponentially when distance increases
– k: distance between pins – p: probability of distance-1 links– P(k): probability of distance-k links
1( ) (1 ) , 1,2,....kP k p p k
52
MCF Interconnection Synthesis • Integrate multiple wire styles to MCF formulation• Notations
– Wire style parameter: (Pe, Ae), Pe=Pw+Ps
– Area Ar: Routing area on vertical and horizontal dimension
– dj:Communication demand for net j, dj=1
– Flow f(t): flow amount on a steiner tree t
53
MCF Formulation: Energy Optimization
54
Routability constr.
Routing Area constr.
Obj: Min Energy
Experiment Settings• Seven of MCNC benchmark circuits
– Technology mapped to 4-LUTs, each logic block contains 16 4-LUTs
– Size of 10x10 to 11x11 switch boxes, 500 ~ 1000 nets
• Candidate topologies– Available segment length = 1, 2, 4, 8– Total number of candidate topologies: 93
alu4 apex4 diffeq dsip ex5p misex3 tseng
size 11x11 10x10 11x11 11x11 10x10 11x11 10x10
# of nets 621 798 945 593 745 771 788
55
Energy Optimization: Optimized FPGA Routing Architectures
56
Energy Impv:19%Energy Impv:27%Energy Impv:28%
Energy:6.46 x10^3 pJEnergy:5.24 x10^3 pJEnergy:4.74 x10^3 pJEnergy: 4.63 x10^3 pJ
Routing Area: 1500 mRouting Area: 2500 mRouting Area: 3500 mRouting Area: 4500 m
RC 1x
RC 2x
RC 4x
T-Line 10x
Energy Optimization: Impact of Routing Area
• Total energy of the 7 benchmarks with optimized FPGA routing architectures
1.2
1.7
2.2
2.7
3.2
3.7
4.2
4.7
1500 2000 2500 3000 3500 4000 4500
Routing Area (um)
En
erg
y (
x1
0^
3 p
J) alu4
apex4
diffeq
dsip
ex5p
misex3
tseng
57
Interconnect Architecture1. Wire Directions (M, Y, X, E)2. Layout Region (M, D, Y, X)3. Power Ground and Clock Distributions4. Layer Assignment5. Via Arrangement
Comparison1. Wire Length2. Throughput3. Grid vs No-grid
58
(a) A 7 by 7 mesh with Y-architecture
(b) A 7 by 7 mesh with Manhattan-architecture (c) A 7 by 7 mesh with X-architecture
7 by 7 meshes with different interconnect architectures
1. Wire Directions and Models
59
(a) A level 2 hexagonal mesh (b) A level 2 octagonal mesh
(c) A level 2 Diamond mesh
Fig. 10 Meshes with symmetrical structures
2. Layout Regions and Models
60
Length of 2 pin-nets to extend an area
LengthShape
Man. Y-Arch X-Arch Euclidean
M: Diamond
1.250 1.118 1.066 1.016
Y: Hexagon
1.101
X: Octagon
1.055
E: Circle 1.273 1.103 1.055 1.000
E (worst) 1.414 1.155 1.082 1.000
Throughput : concurrent flow demand
ThroughputShape
Manhattan Y-Arch X-Arch*
M: Square 1.000 1.225 1.346
M (Bound) 1.241 1.356
M: Diamond
1.195
Y: Hexagon 1.315
X: Octafon 1.420
*ratio of 0-90 planes and 45-135 planes is not fixed
Flow congestion map for uniform 90 Degree meshes
63
12 by 12 13 by 13
Congestion map of square chip using X-architecture
64
12 by 12 13 by 13
Congestion map of square chip using Y-architecture
65
Explanation For Throughput Increasing
(a) 90-degree routing (b) 45-degree routing
d
d
Number of lines across the vertical center cut-line:
d/D for 90 degree routing
for 45 degree routingDd /2
66
67
68
69
Global Grids (Power/Ground Mesh)
(http://www.xinitiative.org/img/062102forum.pdf)
X-Architecture Y-Architecture
3. Clock Tree on Square Mesh• N-level clock tree:
– path distance =
21% less than H-tree– total wire length =
9% less than H tree, 3% less than X tree
• No self-overlapping between parallel wire segments
71
4. Layer Assignment
I II III IVAssignment
Layer 1
Layer 2
Layer 3
Layer 4
Different routing direction assignment
72
N z(I) z(II) z(III) z(IV)
5 1.02 0.83 0.83 1.01
6 0.97 0.73 0.74 0.97
7 0.94 0.71 0.71 0.93
8 0.90 0.69 0.69 0.90
Normalized throughput of mixed 45-degree and 90-degree mesh with different routing layer assignments
73
Why interleaving Manhattan Layer and Diagonal Layer Improves Throughput?
Shortest path between two points on the plane are always a concatenation of a Manhattan line and a Diagonal line.
(2,0)
(0,3)
Wirelength = 5.0
Wirelength = 3.82
74
Observations
• Routing Direction Assignment Strategies Can Affect the Communication Throughput.
• Interleaving the Manhattan Routing Layers and Diagonal Routing Layers can produce better Throughput
75
5. Via Arrangement: Banks and Tunnels• Use tunnels to detour around vias• Use banks of tunnels to maximize the
throughput• Use bottom k layers to perform intra-cell
routing• Use top n-k layers to distribute signals to the
banks
76
Via-Oriented Interconnect Planning
77
Via-Oriented Interconnect Planning
tunnel
78
Via-Oriented Interconnect Planning
Full bandwidth
k+2 overhead
#vias= kLOverhead=k+2 verticalTracksL: dimension of the bank
Bank of tunnels
79
Blocking 5 tracks on the layer of 60-degree direction
Tunnel of Y Arch.
80
Tunnels of Y Arch.
81
3.2 Via-Oriented Interconnect Planning
Bank of tunnels
#vias= c1kL
Overhead=k+c2 tracks
82
Conclusion• Global Interconnect Technologies
– EM waves + Devices
• Prefix Adder Synthesis– Formulation + ILP
• FPGA Interconnect Architecture– Formulation + LP
• Interconnect Architecture– Lambda Geometry + Vias
83
Thank you!Q & A
84