Retiming & Pipelining over Global Interconnectscadlab.cs.ucla.edu/~cong/slides/ibm_jun02.pdfMotivation: How Far Can We Go in Each Clock Cycle 0 7.52 15.04 22.56 24.9 (mm) 1 clock 2

Retiming & Pipelining over Global Retiming & Pipelining over Global InterconnectsInterconnects

Jason CongJason CongComputer Science DepartmentComputer Science Department

University of California, Los AngelesUniversity of California, Los [email protected]@cs.ucla.eduedu

http://cadlab.cs.ucla.edu/~conghttp://cadlab.cs.ucla.edu/~congJoint work with C. C. Chang, D. Pan*, and X. YuanJoint work with C. C. Chang, D. Pan*, and X. Yuan

* IBM Research* IBM Research

Motivation: How Far Can We Go in Each Clock CycleMotivation: How Far Can We Go in Each Clock Cycle

7.52 15.04 22.56 24.9 (mm)0

1 clock 2 clock 3 clock

4 clock

5 clock

6 clock

7 clock NTRS’97 0.07um Tech5 G Hz across-chip clock620 mm2 (24.9mm x 24.9mm)IPEM BIWS estimations

Buffer size: 100xDriver/receiver size: 100x

From corner to corner:7 clock cycles

SolutionsSolutions

Fully asynchronous designsFully asynchronous designs

GALS (global asynchronous locally synchronous designs)GALS (global asynchronous locally synchronous designs)LatencyLatency--insensitive designs insensitive designs

Synchronous designs, with multiSynchronous designs, with multi--cycle communicationscycle communicationsMuch better understoodMuch better understoodSupported by the current tool setSupported by the current tool setMore energy efficient ?More energy efficient ?

InterconnectInterconnect--Centric IC Design Flow Centric IC Design Flow Under Development at UCLAUnder Development at UCLA

Interconnect PerformanceEstimation Models (IPEM)

Architecture/Conceptual-level Design

Design Specification

Final Layout

abstractionStructure viewFunctional viewPhysical viewTiming view

HDM

Synthesis and Placement under Physical Hierarchy

Interconnect Planning• Physical Hierarchy Generation for Multi-Cycle Comm.• Interconnect Architecture Planning

Interconnect Optimization(TRIO)

• Topology Optimization with Buffer Insertion• Wire sizing and spacing• Simultaneous Buffer Insertion and Wire Sizing• Simultaneous Topology Construction

with Buffer Insertion and Wire Sizing

Interconnect LayoutRoute Planning

Point-to-Point Gridless Routing

•OWS, SDWS, BISWS

Interconnect SynthesisTopology genration & wiresizng for delay

Wire ordering & spacing for noise control

Physical Hierarchy Generation for Multi-Cycle Comm.

Physical Hierarchy GenerationPhysical Hierarchy Generation

Hard IP Soft moduleSame color for modules of the same logic hierarchy

Logical Hierarchy

Assign modules to physical hierarchy

Defines global interconnects•Optimization objectives: • wire length minimization• routing congestion minimization• clock period, latency, performance (with consideration of multi-cycle comm.)

Physical Hierarchy = Placement bins + module locationsPhysicalPhysical Hierarchy Generation Problem FormulationHierarchy Generation Problem Formulation

Need of Considering Retiming/Pipelining during PlacementNeed of Considering Retiming/Pipelining during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip

Proper placement allows retiming to Proper placement allows retiming to hidehide global interconnect delays.global interconnect delays.

Placement 1

Before retiming, φ = 5.0

a b c d

After retiming, φ = 3.0


a cbd

Placement 2

d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL

Better Initial Placement !!

Need of Considering Retiming during PlacementNeed of Considering Retiming during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip

Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays.global interconnect delays.

Placement 1


a b c d



a cbd


Placement 2

d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL

Better Initial Placement !!

Difficulties Difficulties

How to consider retiming/pipelining over global How to consider retiming/pipelining over global interconnectsinterconnects

FlipFlip--flop boundaries are not fixed during placement, difficult to do flop boundaries are not fixed during placement, difficult to do static timing analysisstatic timing analysis

How to handle the high complexity of the combined problemHow to handle the high complexity of the combined problem

Answer: Use of the concepts of c-retiming and sequential timing analysis (Seq-TA)

Answer: Use the multi-level optimization technique

Simultaneous Coarse Placement with Retiming on Simultaneous Coarse Placement with Retiming on InterconnectsInterconnects

Our solutionOur solutionCompute the labels of all nodes under cCompute the labels of all nodes under c--retiming for a given retiming for a given placement solution and perform sequential timing analysis (placement solution and perform sequential timing analysis (SeqSeq--TA)TA)Minimize the longest sequential path by improving the placement Minimize the longest sequential path by improving the placement solutionsolution

Alternative solution [Alternative solution [BraytonBrayton, et al], et al]Enforcing all loop constraints during placementEnforcing all loop constraints during placement

Static Timing Analysis (STA)Static Timing Analysis (STA)

a

a

cd

e

f

g

Transform the circuit into a DAG for static timing analysisTopological order: a,b,g,f,c,d,eCompute arrival time (AT) and required time (RT) of each node are computed in linear time.

a

b

cd

e

f

g

Sequential circuit example: PI: a, b. PO: g.

Suppose d(v)=1, d(e)=2a b g f c d e

AT: 1 1 3 3 3 6 9Suppose clock cycle φ =11RT: 9 9 11 9 3 6 9

Continuous Retiming (cContinuous Retiming (c--retiming) and retiming) and Sequential Arrival Time (SAT)Sequential Arrival Time (SAT)

Definition [Pan et al, TCAD98]Definition [Pan et al, TCAD98]Given a clock period Given a clock period φφ, , transfer circuit transfer circuit CC into an edgeinto an edge--weighted vertex weighted weighted vertex weighted graphgraph G, G,

Label vertex v as lLabel vertex v as l((vv) = the weight of longest path from PIs to v = max{) = the weight of longest path from PIs to v = max{ll((uu) ) -- φφ ··ww((u,vu,v) + ) + dd((u,vu,v) + ) + dd((vv)}, )}, ll((vv) is also called ) is also called SAT(v).SAT(v).

Theorem: Theorem: CC can be retimed to can be retimed to φφ + max{+ max{dd((vv)} iff )} iff ll(POs) (POs) ≤≤ φφRelation to retiming: Relation to retiming: rr((vv) = ) = ll((vv) / ) / φφ -- 11Complexity is O(VE)Complexity is O(VE)

d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5l(c) = max{7+2-5·1+1, 3+2+1} = 6

l(a) = 7

l(b) = 3

a

bc

d(a)

d(b)

d(c)

a

bc

ww((a,ca,c)=1)=1

ww((b.cb.c)=0)=0

wl (a,c)= d(e(a,c)) - φφ ·· ww((a,ca,c))

wl (b,c)= d(e(b,c)) - φφ ·· ww(b(b,c,c))

Continuous Retiming (cContinuous Retiming (c--retiming) and retiming) and Sequential Arrival Time (SAT)Sequential Arrival Time (SAT)

a

b

cd

e

f

g

Sequential circuit

d(v)=1, d(e)=2Is φ = 4.5 possible ?

Iter# a b c d e f g 0 0 0 -∞ -∞ -∞ -∞ -∞1 0 0 -1.5 -∞ -∞ -∞ -∞2 0 0 -1.5 1.5 1.5 -∞ -∞3 0 0 -1.5 1.5 4.5 0 04 0 0 -1.5 1.5 4.5 0 05 0 0 -1.5 1.5 4.5 0 0

Cycle time 4.5 is possible because l(g) ≤ 4.5

a

b

cd

e

f

g

Retimed circuit

a

b

cd

e

f

g

Retiming graph (not a DAG)

-2.5 -7

-2.5

-2.5-2.5 -2.5

2 2

2

Continuous Retiming (cContinuous Retiming (c--retiming) and retiming) and Sequential Arrival Time (SAT) (cont’d)Sequential Arrival Time (SAT) (cont’d)

a

b

cd

e

f

g

Sequential circuita

b

cd

e

f

g

Retiming graph (not a DAG)

d(v)=1, d(e)=2Is φ = 2.5 feasible ?

Iter# a b c d e f g 0 0 0 -∞ -∞ -∞ -∞ -∞1 0 0 0.5 -∞ -∞ -∞ -∞2 0 0 0.5 3.5 3.5 -∞ -∞3 0 0 0.5 3.5 6.5 4 4

Cycle time 2.5 is not feasible because l(g) > 2.5

MultiMulti--Level Optimization FrameworkLevel Optimization Framework

Coarsening Uncoarsening &Refinement (optimization)

Problem sizes

• Multi-level coarsening generates smaller problem sizes for top levels faster optimization on top levels

• May explore different aspects of the solution space at different levels• Gradual refinement on good solutions from coarser levels is very efficient• Successful in many applications

•Originally developed for PDE•Recent success in VLSICAD: partitioning, placement, routing

Levels

ChallengesChallenges

Previous Previous SeqSeq--TA can only handle singleTA can only handle single--output gateoutput gateIn reality multiIn reality multi--output modules existoutput modules exist

IP block, MUX, addersIP block, MUX, addersClusters in the multiClusters in the multi--level optimization processlevel optimization process

How to integrate How to integrate SeqSeq--TA into multiTA into multi--level coarse placement level coarse placement efficientlyefficiently

Need to consider congestion and Need to consider congestion and routabilityroutability

Generalize cGeneralize c--retiming for Complex Combinational retiming for Complex Combinational ModulesModules

vI0

vI1

vI2

vO0

vO1

4

11 93

complex module (combinational logic)with multi-output andnon-uniform propagation delay

d’(v)=11

vI0

vI1

vI2

vO0

vO1

ll11--value labeling for value labeling for each vertexeach vertexll11(v)=weight of the longest path from PIs to v (v)=weight of the longest path from PIs to v using dusing d’’(v) as uniform gate delay(v) as uniform gate delayEach vertex has a Each vertex has a ll11--value label.value label.Upper bound of the labelingUpper bound of the labeling

Reduce the non-uniformed gate delayto uniform gate delay by taking the max.

Internal delay as the gate delayd’(v) = max { d(v(i, j)) }

vI0

vI1

vI2

vO0

vO1

4

11 93Flatten/Decompose the complex module

by treating each pin of the module as vertex with zero delay.

ll22--value labeling for value labeling for each output of a vertexeach output of a vertexll22((vvoott )=weight of the longest path from PIs to output )=weight of the longest path from PIs to output oott of vof vEach output of a vertex has a Each output of a vertex has a ll22--value label.value label.Lower bound of the labelingLower bound of the labeling

Properties of Generalized cProperties of Generalized c--retiming for Complex retiming for Complex Combinational ModulesCombinational Modules

Theorem: If Theorem: If ∃∃ a a POPOtt with with ll22((POPOtt )) > > ΦΦ,, then the circuit can not be retimed to a then the circuit can not be retimed to a clock period of clock period of ΦΦ..

Theorem: If for every Theorem: If for every POPOii, , ll11((POPOii))≤≤ ΦΦ,, then the circuit can be retimed to a then the circuit can be retimed to a clock period less than clock period less than ΦΦ+k, +k, wherewhere k k is max. inputis max. input--output delay of all gates.output delay of all gates.

Theorem: For any module v and its outTheorem: For any module v and its out--pin pin vvoott , , ll22((vvoott )) ≤≤ ll11(v).(v).

Theorem: Given a circuit Theorem: Given a circuit C, C, ΦΦ isis the min. clock period achieved by the min. clock period achieved by retiming on circuit retiming on circuit C,C, if if CCc c is derived from is derived from CC by performing clustering ,and by performing clustering ,and the min. clock period achieved by retiming on the min. clock period achieved by retiming on CCcc is is ΦΦcc, then , then ΦΦ ≤≤ ΦΦc.c.

IntegrateIntegrate SeqSeq--TA with a MultiTA with a Multi--level SAlevel SA--based based Coarse PlacementCoarse Placement

In coarsening phase,FFs can only be clustered after a certain level k

From levelFrom level LLn n toto LLkk+1+1

perform static timing perform static timing analysis (whereanalysis (where FFsFFs areareclusterdclusterd))From levelFrom level LLkk toto LL00 performperformSeqSeq--TA (whereTA (where FFsFFs are not are not clustered)clustered)

Level L0

Level Lk

….

….Level Ln

Refinement by timing-drivenSA-based coarse placement

Initial Placement

….

….

Area Density Problems in MultiArea Density Problems in Multi--level Coarse level Coarse PlacementPlacement

Traditional area density control:Cell area in each bin < bin area utilization with a small percentage of overflow

Does not work when cluster sizes may have significant variations and may be bigger than a binHow about use different grid sizes for different levels of clustering?

Hard to find fixed percentages that worksSignificant placement cost jump when switch grid sizes

Hierarchical Area Density ControlHierarchical Area Density Control

Use the same grid structure for placement for all clustering levelsImpose hierarchy on bin structure for area density controlEach cluster move must satisfy the area constraints on each level in the bin hierarchyArea constraint for moving a cell of size A

Allowed overflow on each level in the bin hierarchy = kA, k is a small constant (usually 1 or 2)

Work well in multi-level framework:Area constraints gradually tightened during optimization

Fast Incremental AFast Incremental A--tree Routing for Multitree Routing for Multi--pin Netspin Nets

Simple incremental A-treeRecursively Quad-partition gridsEach pin recursively connects to lower left corner of each level of partition

For net with bounding box length B, at most 2 *log B edge updates for each pin move, except the root. Each edge routed by LZ-router

First Quadrant

Root(source pin)

Fast LZFast LZ--routing for Tworouting for Two--pin Connectionspin Connections

Decide HVH or VHV:Select the less congested layer

Binary search on V-stem (or H-stem)Initial left region and right region to cover bounding boxRepeat

Query wire usage on both regionsSelect region with less congestion

Wire usage query can be done in O(log grid_size)

Left region Right region

HVH VHV

Placement Cost FunctionsPlacement Cost Functions

Wire length driven: Summation of net bounding boxes of all Wire length driven: Summation of net bounding boxes of all netsnets

Congestion driven:Congestion driven:Wire usages estimated from the fast global routerWire usages estimated from the fast global routerCost = Summation of square of wire usages in all binsCost = Summation of square of wire usages in all binsFor fixed wire widthFor fixed wire width

cost equivalent to summation of weighted wire length, weight on cost equivalent to summation of weighted wire length, weight on a a bin = wire usage of the binbin = wire usage of the bin

For congestion driven run: only turns on congestion driven cost For congestion driven run: only turns on congestion driven cost at at the finest placement levelthe finest placement level

W1 W2 W3

Congestion cost = W12 + W22 + … + W92 W4 W5 W6

W7 W8 W9

Experimental Results on Wire Length Minimization Experimental Results on Wire Length Minimization

Multi-level simulated annealing coarse placementWire length comparison with GORDIAN-L:

Our engine only turns on wire length optimizationLegalized by DOMINO for wire length comparison

Our multi-level engine performs well for big circuits

• 20k-50k test cases: avqlarge, avqsmall, ibm04, ibm07

• 50k-100k test cases: ibm09, ibm10

• 100k-210k test cases: ibm14, ibm15, ibm16, ibm17, ibm18

mPG+DOM/GOR+DOM Wire Length Comparison

97%

100%

96%

93%

94%

95%

96%

97%

98%

99%

100%

20k-50k 50-100k 100k-210k

mPG+DOM/GOR+DOM CPU Time Comparison

81%

43%

22%

0%10%20%30%40%50%60%70%80%90%

20k-50k 50-100k 100k-210k

Experimental Results on Congestion ControlExperimental Results on Congestion Control

18.918.90.210.210.870.870.940.941.051.05mPGmPG--cgcg

6.16.10.470.470.930.930.970.971.051.05mPGmPG--cg.rdcg.rd

1111111111mPGmPG

CPUCPUTotal Total overflowoverflow

Max Max boundaryboundary

congestioncongestion

Routed WLRouted WLBBOX WLBBOX WL

Test cases: ibm01, ibm04, ibm07, ibm11, ibm13, ibm15

mPG: wire length driven modemPG-cg: congestion driven at finest clustering levelmPG-cg.rd: alternative congestion driven + wire length driven at fines clustering level

Initial Experimental Result on Impact of Initial Experimental Result on Impact of Simultaneous Retiming and PlacementSimultaneous Retiming and Placement

0.790.790.930.9311Avg.Avg.848411511512112116x1616x16101531101531Ind4Ind449749757757758258216x1616x165219752197Ind3Ind335353939515116x1616x162606026060Ind2Ind232532534934934934916x1616x162978029780Ind1Ind13232383841418x88x81320913209S38584S38584

dlydlydly dly

(after retiming)(after retiming)

dlydly

(before retiming)(before retiming)

Simultaneous Simultaneous retiming and retiming and placementplacement

WLWL--driven driven

placementplacement

Grid sizeGrid size#gates#gatescircuitcircuit

Limitation of Exploring MultiLimitation of Exploring Multi--cycle Interconnect cycle Interconnect Communication during Logic SynthesisCommunication during Logic Synthesis

Minimum clock period can be achieved by logic Minimum clock period can be achieved by logic optimization is bounded by max. delayoptimization is bounded by max. delay--toto--register (DR) register (DR) ratio of the loops in the circuits ratio of the loops in the circuits

Require consideration of multiRequire consideration of multi--cycle communication cycle communication during architecture & behavior synthesisduring architecture & behavior synthesis

• In a loop, 4 logic cells, 2 registers• Cell delay =1ns• Interconnect delay=1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns• Clock cycle >= 4ns

Global Interconnect

…FUC

Reg. file

…FUC

Reg. file

…FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…FUC

Reg. file

Regular Distributed Register Architecture (1)Regular Distributed Register Architecture (1)

Distribute registers to each “island”Local computation and communication in each island can be done in a single clock cycleBut registers may need to be inserted along global interconnects for multi-cycle communication (less regular)

FunctionUnit Cluster

(FUC)

….Register File

Wi

Hi

Island

THWDDDDD iiopticopticislandra ≤++≤+= −−− )22(intlogintlogint

ADD

MUXDIV

Cluster with area constraint

Global Interconnect

…FUC

Reg. file

…FUC

Reg. file

…FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…FUC

Reg. file

Regular Distributed Register Architecture (2)Regular Distributed Register Architecture (2)

FunctionUnit Cluster

(FUC)

….Register File

Wi

Hi

Island

1 cycle

k cycle

THWDDDDD iiopticopticislandra ≤++≤+= −−− )22(intlogintlogint

2 cycle

ADD

MUXDIV

Cluster with area constraint

Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each islandHighly regular

Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm TechnologyArchitecture for 70nm Technology

NTRS’97 70nm TechChip dimension: 620 mm2 (24.9mm x 24.9mm)5 G Hz across-chip clock• Wire can travel up to 7.52mm within 1 clock

cycle under interconnect optimization• Need 7 clock cycles to cross the chip

Each island base dimension• Wi = Hi=2.08mm• = critical length (longest length that a wire

can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x

• 1/3 of distance a wire can travel in 1 clock cycle

• Logic volume: 6.76M min-size 2-NAND gates12X12 island-base arrayLocal registers are partitioned to 7 banks

≈

+ 2

* 3 * 4

- 6- 5

* 7 * 8

- 9 * 11 * 12

- 10

- 1

Data flow graph extracted from discrete cosine transformation (DCT)The delay of * operation is 2ns, the delay of + and – operation is 1ns.The resources available are 2 multipliers and 2 ALUs.

- 1 + 2

* 3 * 4

- 6- 5

* 7 * 8

- 9 * 11 * 12

- 10

The nodes with the same color are assigned to the same functional unit.

Example: Impact of Interconnect on SchedulingExample: Impact of Interconnect on Scheduling

Wirelength-driven Placement

Reg. file

Reg. file…Alu1

1,5,10Alu22,6,9

…FUC

Reg. file

Reg. file…Mul23,7,12

…Mul14,8,11

Represents long Interconnect delay. The long interconnect delay is 2ns.

Represents short Interconnect delay. Short Interconnect delay is 1ns.

- 1 + 2

* 3 * 4

- 6- 5

* 7 * 8

- 9 * 11 * 12

- 10

SingleSingle--cycle vs. Multicycle vs. Multi--cycle Interconnect Communicationcycle Interconnect Communication

Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4nsTotal latency is 24ns

Cycle 1

Cycle2

Cycle3

Cycle 4

Cycle5

Cycle6

Represents registers. + 2

- 1

* 3 * 4

- 6

- 5

* 7

* 12

- 9

* 11

* 8

- 10

Cycle1

Cycle2

Cycle3

Cycle 4

Cycle5

Cycle6

Cycle7

Cycle8

Cycle9

Multi-cycle interconnect communicationScheduled in 9 clock cyclesClock period is 2nsTotal latency is 18ns

+ 2- 1

* 3 * 4

- 6- 5

* 7

* 11

- 9

* 8

* 12

- 10

Reg. file

Reg. file…Alu1

1,5,10

…Alu22,6,9

Reg. file


…Mul14,8,11

Simultaneous Placement and Scheduling

With placement integrated with scheduling, critical path is reduced.The DFG can be scheduled in 8 clock cycles, with clock period of 2ns.The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimizationfor Performance Optimization

Cycle1

Cycle2

Cycle3

Cycle4

Cycle5

Cycle6

Cycle7

Cycle8

+ 2- 1

* 3 * 4

- 6- 5

* 7 * 8

- 9

* 11 * 12

- 10

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimizationand Binding for Performance Optimization

Simultaneous Placement, Scheduling and Binding

With placement integrated with scheduling and binding, the critical path is further reduced.The DFG can be scheduled in 7 clock cycles, with clock period of 2ns.The total latency is 14ns

Cycle1

Cycle2

Cycle3

Cycle4

Cycle5

Cycle6

Cycle7

Reg. file

Reg. file…Alu1

1,5,10

…Alu22,6,9

Reg. file


…Mul14,8,12

+ 2- 1

* 3 * 4

- 6- 5

* 7

* 8

- 9

* 11

* 12

- 10

Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264

Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

ConclusionsConclusions

MultiMulti--cycle communication is needed for gigahertz designscycle communication is needed for gigahertz designs

Sequential timing analysis + multilevel optimization Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global enables efficient retiming/pipelining over global interconnectsinterconnects

Regular distributed register (RDR) fabric provides Regular distributed register (RDR) fabric provides regularity to supportregularity to support

MulticycleMulticycle communicationcommunicationIntegrated resource binding, scheduling, and physical planningIntegrated resource binding, scheduling, and physical planning

Documents

Retiming & Pipelining over Global Interconnectscadlab.cs.ucla.edu/~cong/slides/ibm_jun02.pdfMotivation: How Far Can We Go in Each Clock Cycle 0 7.52 15.04 22.56 24.9 (mm) 1 clock 2