Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Retiming & Pipelining over Global Retiming & Pipelining over Global InterconnectsInterconnects
Jason CongJason CongComputer Science DepartmentComputer Science Department
University of California, Los AngelesUniversity of California, Los [email protected]@cs.ucla.eduedu
http://cadlab.cs.ucla.edu/~conghttp://cadlab.cs.ucla.edu/~congJoint work with C. C. Chang, D. Pan*, and X. YuanJoint work with C. C. Chang, D. Pan*, and X. Yuan
* IBM Research* IBM Research
Motivation: How Far Can We Go in Each Clock CycleMotivation: How Far Can We Go in Each Clock Cycle
7.52 15.04 22.56 24.9 (mm)0
1 clock 2 clock 3 clock
4 clock
5 clock
6 clock
7 clock NTRS’97 0.07um Tech5 G Hz across-chip clock620 mm2 (24.9mm x 24.9mm)IPEM BIWS estimations
Buffer size: 100xDriver/receiver size: 100x
From corner to corner:7 clock cycles
SolutionsSolutions
Fully asynchronous designsFully asynchronous designs
GALS (global asynchronous locally synchronous designs)GALS (global asynchronous locally synchronous designs)LatencyLatency--insensitive designs insensitive designs
Synchronous designs, with multiSynchronous designs, with multi--cycle communicationscycle communicationsMuch better understoodMuch better understoodSupported by the current tool setSupported by the current tool setMore energy efficient ?More energy efficient ?
InterconnectInterconnect--Centric IC Design Flow Centric IC Design Flow Under Development at UCLAUnder Development at UCLA
Interconnect PerformanceEstimation Models (IPEM)
Architecture/Conceptual-level Design
Design Specification
Final Layout
abstractionStructure viewFunctional viewPhysical viewTiming view
HDM
Synthesis and Placement under Physical Hierarchy
Interconnect Planning• Physical Hierarchy Generation for Multi-Cycle Comm.• Interconnect Architecture Planning
Interconnect Optimization(TRIO)
• Topology Optimization with Buffer Insertion• Wire sizing and spacing• Simultaneous Buffer Insertion and Wire Sizing• Simultaneous Topology Construction
with Buffer Insertion and Wire Sizing
Interconnect LayoutRoute Planning
Point-to-Point Gridless Routing
•OWS, SDWS, BISWS
Interconnect SynthesisTopology genration & wiresizng for delay
Wire ordering & spacing for noise control
Physical Hierarchy Generation for Multi-Cycle Comm.
Physical Hierarchy GenerationPhysical Hierarchy Generation
Hard IP Soft moduleSame color for modules of the same logic hierarchy
Logical Hierarchy
Assign modules to physical hierarchy
Defines global interconnects•Optimization objectives: • wire length minimization• routing congestion minimization• clock period, latency, performance (with consideration of multi-cycle comm.)
Physical Hierarchy = Placement bins + module locationsPhysicalPhysical Hierarchy Generation Problem FormulationHierarchy Generation Problem Formulation
Need of Considering Retiming/Pipelining during PlacementNeed of Considering Retiming/Pipelining during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects
Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip
Proper placement allows retiming to Proper placement allows retiming to hidehide global interconnect delays.global interconnect delays.
Placement 1
Before retiming, φ = 5.0
a b c d
After retiming, φ = 3.0
Before retiming, φ = 4.0
a cbd
Placement 2
d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL
Better Initial Placement !!
Need of Considering Retiming during PlacementNeed of Considering Retiming during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects
Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip
Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays.global interconnect delays.
Placement 1
Before retiming, φ = 5.0
a b c d
After retiming, φ = 3.0
Before retiming, φ = 4.0
a cbd
After retiming, φ = 4.0
Placement 2
d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL
Better Initial Placement !!
Difficulties Difficulties
How to consider retiming/pipelining over global How to consider retiming/pipelining over global interconnectsinterconnects
FlipFlip--flop boundaries are not fixed during placement, difficult to do flop boundaries are not fixed during placement, difficult to do static timing analysisstatic timing analysis
How to handle the high complexity of the combined problemHow to handle the high complexity of the combined problem
Answer: Use of the concepts of c-retiming and sequential timing analysis (Seq-TA)
Answer: Use the multi-level optimization technique
Simultaneous Coarse Placement with Retiming on Simultaneous Coarse Placement with Retiming on InterconnectsInterconnects
Our solutionOur solutionCompute the labels of all nodes under cCompute the labels of all nodes under c--retiming for a given retiming for a given placement solution and perform sequential timing analysis (placement solution and perform sequential timing analysis (SeqSeq--TA)TA)Minimize the longest sequential path by improving the placement Minimize the longest sequential path by improving the placement solutionsolution
Alternative solution [Alternative solution [BraytonBrayton, et al], et al]Enforcing all loop constraints during placementEnforcing all loop constraints during placement
Static Timing Analysis (STA)Static Timing Analysis (STA)
a
a
cd
e
f
g
Transform the circuit into a DAG for static timing analysisTopological order: a,b,g,f,c,d,eCompute arrival time (AT) and required time (RT) of each node are computed in linear time.
a
b
cd
e
f
g
Sequential circuit example: PI: a, b. PO: g.
Suppose d(v)=1, d(e)=2a b g f c d e
AT: 1 1 3 3 3 6 9Suppose clock cycle φ =11RT: 9 9 11 9 3 6 9
Continuous Retiming (cContinuous Retiming (c--retiming) and retiming) and Sequential Arrival Time (SAT)Sequential Arrival Time (SAT)
Definition [Pan et al, TCAD98]Definition [Pan et al, TCAD98]Given a clock period Given a clock period φφ, , transfer circuit transfer circuit CC into an edgeinto an edge--weighted vertex weighted weighted vertex weighted graphgraph G, G,
Label vertex v as lLabel vertex v as l((vv) = the weight of longest path from PIs to v = max{) = the weight of longest path from PIs to v = max{ll((uu) ) -- φφ ··ww((u,vu,v) + ) + dd((u,vu,v) + ) + dd((vv)}, )}, ll((vv) is also called ) is also called SAT(v).SAT(v).
Theorem: Theorem: CC can be retimed to can be retimed to φφ + max{+ max{dd((vv)} iff )} iff ll(POs) (POs) ≤≤ φφRelation to retiming: Relation to retiming: rr((vv) = ) = ll((vv) / ) / φφ -- 11Complexity is O(VE)Complexity is O(VE)
d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5l(c) = max{7+2-5·1+1, 3+2+1} = 6
l(a) = 7
l(b) = 3
a
bc
d(a)
d(b)
d(c)
a
bc
ww((a,ca,c)=1)=1
ww((b.cb.c)=0)=0
wl (a,c)= d(e(a,c)) - φφ ·· ww((a,ca,c))
wl (b,c)= d(e(b,c)) - φφ ·· ww(b(b,c,c))
Continuous Retiming (cContinuous Retiming (c--retiming) and retiming) and Sequential Arrival Time (SAT)Sequential Arrival Time (SAT)
a
b
cd
e
f
g
Sequential circuit
d(v)=1, d(e)=2Is φ = 4.5 possible ?
Iter# a b c d e f g 0 0 0 -∞ -∞ -∞ -∞ -∞1 0 0 -1.5 -∞ -∞ -∞ -∞2 0 0 -1.5 1.5 1.5 -∞ -∞3 0 0 -1.5 1.5 4.5 0 04 0 0 -1.5 1.5 4.5 0 05 0 0 -1.5 1.5 4.5 0 0
Cycle time 4.5 is possible because l(g) ≤ 4.5
a
b
cd
e
f
g
Retimed circuit
a
b
cd
e
f
g
Retiming graph (not a DAG)
-2.5 -7
-2.5
-2.5-2.5 -2.5
2 2
2
Continuous Retiming (cContinuous Retiming (c--retiming) and retiming) and Sequential Arrival Time (SAT) (cont’d)Sequential Arrival Time (SAT) (cont’d)
a
b
cd
e
f
g
Sequential circuita
b
cd
e
f
g
Retiming graph (not a DAG)
d(v)=1, d(e)=2Is φ = 2.5 feasible ?
Iter# a b c d e f g 0 0 0 -∞ -∞ -∞ -∞ -∞1 0 0 0.5 -∞ -∞ -∞ -∞2 0 0 0.5 3.5 3.5 -∞ -∞3 0 0 0.5 3.5 6.5 4 4
Cycle time 2.5 is not feasible because l(g) > 2.5
MultiMulti--Level Optimization FrameworkLevel Optimization Framework
Coarsening Uncoarsening &Refinement (optimization)
Problem sizes
• Multi-level coarsening generates smaller problem sizes for top levels faster optimization on top levels
• May explore different aspects of the solution space at different levels• Gradual refinement on good solutions from coarser levels is very efficient• Successful in many applications
•Originally developed for PDE•Recent success in VLSICAD: partitioning, placement, routing
Levels
ChallengesChallenges
Previous Previous SeqSeq--TA can only handle singleTA can only handle single--output gateoutput gateIn reality multiIn reality multi--output modules existoutput modules exist
IP block, MUX, addersIP block, MUX, addersClusters in the multiClusters in the multi--level optimization processlevel optimization process
How to integrate How to integrate SeqSeq--TA into multiTA into multi--level coarse placement level coarse placement efficientlyefficiently
Need to consider congestion and Need to consider congestion and routabilityroutability
Generalize cGeneralize c--retiming for Complex Combinational retiming for Complex Combinational ModulesModules
vI0
vI1
vI2
vO0
vO1
4
11 93
complex module (combinational logic)with multi-output andnon-uniform propagation delay
d’(v)=11
vI0
vI1
vI2
vO0
vO1
ll11--value labeling for value labeling for each vertexeach vertexll11(v)=weight of the longest path from PIs to v (v)=weight of the longest path from PIs to v using dusing d’’(v) as uniform gate delay(v) as uniform gate delayEach vertex has a Each vertex has a ll11--value label.value label.Upper bound of the labelingUpper bound of the labeling
Reduce the non-uniformed gate delayto uniform gate delay by taking the max.
Internal delay as the gate delayd’(v) = max { d(v(i, j)) }
vI0
vI1
vI2
vO0
vO1
4
11 93Flatten/Decompose the complex module
by treating each pin of the module as vertex with zero delay.
ll22--value labeling for value labeling for each output of a vertexeach output of a vertexll22((vvoott )=weight of the longest path from PIs to output )=weight of the longest path from PIs to output oott of vof vEach output of a vertex has a Each output of a vertex has a ll22--value label.value label.Lower bound of the labelingLower bound of the labeling
Properties of Generalized cProperties of Generalized c--retiming for Complex retiming for Complex Combinational ModulesCombinational Modules
Theorem: If Theorem: If ∃∃ a a POPOtt with with ll22((POPOtt )) > > ΦΦ,, then the circuit can not be retimed to a then the circuit can not be retimed to a clock period of clock period of ΦΦ..
Theorem: If for every Theorem: If for every POPOii, , ll11((POPOii))≤≤ ΦΦ,, then the circuit can be retimed to a then the circuit can be retimed to a clock period less than clock period less than ΦΦ+k, +k, wherewhere k k is max. inputis max. input--output delay of all gates.output delay of all gates.
Theorem: For any module v and its outTheorem: For any module v and its out--pin pin vvoott , , ll22((vvoott )) ≤≤ ll11(v).(v).
Theorem: Given a circuit Theorem: Given a circuit C, C, ΦΦ isis the min. clock period achieved by the min. clock period achieved by retiming on circuit retiming on circuit C,C, if if CCc c is derived from is derived from CC by performing clustering ,and by performing clustering ,and the min. clock period achieved by retiming on the min. clock period achieved by retiming on CCcc is is ΦΦcc, then , then ΦΦ ≤≤ ΦΦc.c.
IntegrateIntegrate SeqSeq--TA with a MultiTA with a Multi--level SAlevel SA--based based Coarse PlacementCoarse Placement
In coarsening phase,FFs can only be clustered after a certain level k
From levelFrom level LLn n toto LLkk+1+1
perform static timing perform static timing analysis (whereanalysis (where FFsFFs areareclusterdclusterd))From levelFrom level LLkk toto LL00 performperformSeqSeq--TA (whereTA (where FFsFFs are not are not clustered)clustered)
Level L0
Level Lk
….
….Level Ln
Refinement by timing-drivenSA-based coarse placement
Initial Placement
….
….
Area Density Problems in MultiArea Density Problems in Multi--level Coarse level Coarse PlacementPlacement
Traditional area density control:Cell area in each bin < bin area utilization with a small percentage of overflow
Does not work when cluster sizes may have significant variations and may be bigger than a binHow about use different grid sizes for different levels of clustering?
Hard to find fixed percentages that worksSignificant placement cost jump when switch grid sizes
Hierarchical Area Density ControlHierarchical Area Density Control
Use the same grid structure for placement for all clustering levelsImpose hierarchy on bin structure for area density controlEach cluster move must satisfy the area constraints on each level in the bin hierarchyArea constraint for moving a cell of size A
Allowed overflow on each level in the bin hierarchy = kA, k is a small constant (usually 1 or 2)
Work well in multi-level framework:Area constraints gradually tightened during optimization
Fast Incremental AFast Incremental A--tree Routing for Multitree Routing for Multi--pin Netspin Nets
Simple incremental A-treeRecursively Quad-partition gridsEach pin recursively connects to lower left corner of each level of partition
For net with bounding box length B, at most 2 *log B edge updates for each pin move, except the root. Each edge routed by LZ-router
First Quadrant
Root(source pin)
Fast LZFast LZ--routing for Tworouting for Two--pin Connectionspin Connections
Decide HVH or VHV:Select the less congested layer
Binary search on V-stem (or H-stem)Initial left region and right region to cover bounding boxRepeat
Query wire usage on both regionsSelect region with less congestion
Wire usage query can be done in O(log grid_size)
Left region Right region
HVH VHV
Placement Cost FunctionsPlacement Cost Functions
Wire length driven: Summation of net bounding boxes of all Wire length driven: Summation of net bounding boxes of all netsnets
Congestion driven:Congestion driven:Wire usages estimated from the fast global routerWire usages estimated from the fast global routerCost = Summation of square of wire usages in all binsCost = Summation of square of wire usages in all binsFor fixed wire widthFor fixed wire width
cost equivalent to summation of weighted wire length, weight on cost equivalent to summation of weighted wire length, weight on a a bin = wire usage of the binbin = wire usage of the bin
For congestion driven run: only turns on congestion driven cost For congestion driven run: only turns on congestion driven cost at at the finest placement levelthe finest placement level
W1 W2 W3
Congestion cost = W12 + W22 + … + W92 W4 W5 W6
W7 W8 W9
Experimental Results on Wire Length Minimization Experimental Results on Wire Length Minimization
Multi-level simulated annealing coarse placementWire length comparison with GORDIAN-L:
Our engine only turns on wire length optimizationLegalized by DOMINO for wire length comparison
Our multi-level engine performs well for big circuits
• 20k-50k test cases: avqlarge, avqsmall, ibm04, ibm07
• 50k-100k test cases: ibm09, ibm10
• 100k-210k test cases: ibm14, ibm15, ibm16, ibm17, ibm18
mPG+DOM/GOR+DOM Wire Length Comparison
97%
100%
96%
93%
94%
95%
96%
97%
98%
99%
100%
20k-50k 50-100k 100k-210k
mPG+DOM/GOR+DOM CPU Time Comparison
81%
43%
22%
0%10%20%30%40%50%60%70%80%90%
20k-50k 50-100k 100k-210k
Experimental Results on Congestion ControlExperimental Results on Congestion Control
18.918.90.210.210.870.870.940.941.051.05mPGmPG--cgcg
6.16.10.470.470.930.930.970.971.051.05mPGmPG--cg.rdcg.rd
1111111111mPGmPG
CPUCPUTotal Total overflowoverflow
Max Max boundaryboundary
congestioncongestion
Routed WLRouted WLBBOX WLBBOX WL
Test cases: ibm01, ibm04, ibm07, ibm11, ibm13, ibm15
mPG: wire length driven modemPG-cg: congestion driven at finest clustering levelmPG-cg.rd: alternative congestion driven + wire length driven at fines clustering level
Initial Experimental Result on Impact of Initial Experimental Result on Impact of Simultaneous Retiming and PlacementSimultaneous Retiming and Placement
0.790.790.930.9311Avg.Avg.848411511512112116x1616x16101531101531Ind4Ind449749757757758258216x1616x165219752197Ind3Ind335353939515116x1616x162606026060Ind2Ind232532534934934934916x1616x162978029780Ind1Ind13232383841418x88x81320913209S38584S38584
dlydlydly dly
(after retiming)(after retiming)
dlydly
(before retiming)(before retiming)
Simultaneous Simultaneous retiming and retiming and placementplacement
WLWL--driven driven
placementplacement
Grid sizeGrid size#gates#gatescircuitcircuit
Limitation of Exploring MultiLimitation of Exploring Multi--cycle Interconnect cycle Interconnect Communication during Logic SynthesisCommunication during Logic Synthesis
Minimum clock period can be achieved by logic Minimum clock period can be achieved by logic optimization is bounded by max. delayoptimization is bounded by max. delay--toto--register (DR) register (DR) ratio of the loops in the circuits ratio of the loops in the circuits
Require consideration of multiRequire consideration of multi--cycle communication cycle communication during architecture & behavior synthesisduring architecture & behavior synthesis
• In a loop, 4 logic cells, 2 registers• Cell delay =1ns• Interconnect delay=1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns• Clock cycle >= 4ns
Global Interconnect
…FUC
Reg. file
…FUC
Reg. file
…FUC
Reg. file
…
FUC
Reg. file
…
FUC
Reg. file
…FUC
Reg. file
Regular Distributed Register Architecture (1)Regular Distributed Register Architecture (1)
Distribute registers to each “island”Local computation and communication in each island can be done in a single clock cycleBut registers may need to be inserted along global interconnects for multi-cycle communication (less regular)
FunctionUnit Cluster
(FUC)
….Register File
Wi
Hi
Island
THWDDDDD iiopticopticislandra ≤++≤+= −−− )22(intlogintlogint
ADD
MUXDIV
Cluster with area constraint
Global Interconnect
…FUC
Reg. file
…FUC
Reg. file
…FUC
Reg. file
…
FUC
Reg. file
…
FUC
Reg. file
…FUC
Reg. file
Regular Distributed Register Architecture (2)Regular Distributed Register Architecture (2)
FunctionUnit Cluster
(FUC)
….Register File
Wi
Hi
Island
1 cycle
k cycle
THWDDDDD iiopticopticislandra ≤++≤+= −−− )22(intlogintlogint
2 cycle
ADD
MUXDIV
Cluster with area constraint
Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each islandHighly regular
Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm TechnologyArchitecture for 70nm Technology
NTRS’97 70nm TechChip dimension: 620 mm2 (24.9mm x 24.9mm)5 G Hz across-chip clock• Wire can travel up to 7.52mm within 1 clock
cycle under interconnect optimization• Need 7 clock cycles to cross the chip
Each island base dimension• Wi = Hi=2.08mm• = critical length (longest length that a wire
can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x
• 1/3 of distance a wire can travel in 1 clock cycle
• Logic volume: 6.76M min-size 2-NAND gates12X12 island-base arrayLocal registers are partitioned to 7 banks
≈
+ 2
* 3 * 4
- 6- 5
* 7 * 8
- 9 * 11 * 12
- 10
- 1
Data flow graph extracted from discrete cosine transformation (DCT)The delay of * operation is 2ns, the delay of + and – operation is 1ns.The resources available are 2 multipliers and 2 ALUs.
- 1 + 2
* 3 * 4
- 6- 5
* 7 * 8
- 9 * 11 * 12
- 10
The nodes with the same color are assigned to the same functional unit.
Example: Impact of Interconnect on SchedulingExample: Impact of Interconnect on Scheduling
Wirelength-driven Placement
Reg. file
Reg. file…Alu1
1,5,10Alu22,6,9
…FUC
Reg. file
Reg. file…Mul23,7,12
…Mul14,8,11
Represents long Interconnect delay. The long interconnect delay is 2ns.
Represents short Interconnect delay. Short Interconnect delay is 1ns.
- 1 + 2
* 3 * 4
- 6- 5
* 7 * 8
- 9 * 11 * 12
- 10
SingleSingle--cycle vs. Multicycle vs. Multi--cycle Interconnect Communicationcycle Interconnect Communication
Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4nsTotal latency is 24ns
Cycle 1
Cycle2
Cycle3
Cycle 4
Cycle5
Cycle6
Represents registers. + 2
- 1
* 3 * 4
- 6
- 5
* 7
* 12
- 9
* 11
* 8
- 10
Cycle1
Cycle2
Cycle3
Cycle 4
Cycle5
Cycle6
Cycle7
Cycle8
Cycle9
Multi-cycle interconnect communicationScheduled in 9 clock cyclesClock period is 2nsTotal latency is 18ns
+ 2- 1
* 3 * 4
- 6- 5
* 7
* 11
- 9
* 8
* 12
- 10
Reg. file
Reg. file…Alu1
1,5,10
…Alu22,6,9
Reg. file
Reg. file…Mul23,7,12
…Mul14,8,11
Simultaneous Placement and Scheduling
With placement integrated with scheduling, critical path is reduced.The DFG can be scheduled in 8 clock cycles, with clock period of 2ns.The total latency is 16ns.
Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimizationfor Performance Optimization
Cycle1
Cycle2
Cycle3
Cycle4
Cycle5
Cycle6
Cycle7
Cycle8
+ 2- 1
* 3 * 4
- 6- 5
* 7 * 8
- 9
* 11 * 12
- 10
Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimizationand Binding for Performance Optimization
Simultaneous Placement, Scheduling and Binding
With placement integrated with scheduling and binding, the critical path is further reduced.The DFG can be scheduled in 7 clock cycles, with clock period of 2ns.The total latency is 14ns
Cycle1
Cycle2
Cycle3
Cycle4
Cycle5
Cycle6
Cycle7
Reg. file
Reg. file…Alu1
1,5,10
…Alu22,6,9
Reg. file
Reg. file…Mul23,7,11
…Mul14,8,12
+ 2- 1
* 3 * 4
- 6- 5
* 7
* 8
- 9
* 11
* 12
- 10
Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264
Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al
ConclusionsConclusions
MultiMulti--cycle communication is needed for gigahertz designscycle communication is needed for gigahertz designs
Sequential timing analysis + multilevel optimization Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global enables efficient retiming/pipelining over global interconnectsinterconnects
Regular distributed register (RDR) fabric provides Regular distributed register (RDR) fabric provides regularity to supportregularity to support
MulticycleMulticycle communicationcommunicationIntegrated resource binding, scheduling, and physical planningIntegrated resource binding, scheduling, and physical planning