Constraint-Driven Large Scale Circuit Placement Algorithms

Constraint-Driven Large Scale Circuit Constraint-Driven Large Scale Circuit Placement AlgorithmsPlacement Algorithms

Advisor: Prof. Jason CongAdvisor: Prof. Jason Cong

Student: Min XieStudent: Min Xie

September, 2006September, 2006

UCLA VLSICAD LAB

OutlineOutline Chapter 1. IntroductionChapter 1. Introduction

Chapter 2. Optimality and scalability study of existing placement Chapter 2. Optimality and scalability study of existing placement

algorithmsalgorithms

Chapter 3. Routability driven multilevel global placement and white Chapter 3. Routability driven multilevel global placement and white

space allocation space allocation

Chapter 4. A robust legalization scheme for mixed-size placementChapter 4. A robust legalization scheme for mixed-size placement

Chapter 5. Applications of mixed-size placement legalizationChapter 5. Applications of mixed-size placement legalization

Chapter 6. “Global” localized preprocessing for detailed placementChapter 6. “Global” localized preprocessing for detailed placement

Chapter 7. Heterogeneous placement for FPGAsChapter 7. Heterogeneous placement for FPGAs

Chapter 8. Conclusions and future worksChapter 8. Conclusions and future works

UCLA VLSICAD LAB

Publication ListPublication List Cong. J, Xie M., and Zhang Y. “An Enhanced Multilevel Routing System,” Cong. J, Xie M., and Zhang Y. “An Enhanced Multilevel Routing System,”

Proceedings of the ICCADProceedings of the ICCAD, pp. 51-58, 2002., pp. 51-58, 2002.

Chang C., Cong J. and Xie M., “Optimality and Scalability of Existing Placement Chang C., Cong J. and Xie M., “Optimality and Scalability of Existing Placement

Algorithms,” Algorithms,” Proceedings of ASPDACProceedings of ASPDAC, pp. 621-627, 2003., pp. 621-627, 2003.

Cong J., Romesis M. and Xie M., “Optimality, Scalability and Stability Study of Cong J., Romesis M. and Xie M., “Optimality, Scalability and Stability Study of

Existing Partitioning and Placement Algorithms,” Existing Partitioning and Placement Algorithms,” Proceedings of ISPDProceedings of ISPD, pp. 88-94, , pp. 88-94,

2003.2003.

Cong J., Romesis M. and Xie M., “Optimality and Stability Study of Timing-driven Cong J., Romesis M. and Xie M., “Optimality and Stability Study of Timing-driven

Placement Algorithms,” Placement Algorithms,” Proceedings of ICCADProceedings of ICCAD, pp. 472-478, 2003., pp. 472-478, 2003.

Cong J., Kong T., Shinnerl J. Xie M. and Yuan X. “Large-Scale Circuit Placement: Cong J., Kong T., Shinnerl J. Xie M. and Yuan X. “Large-Scale Circuit Placement:

Gap and Promise,” Gap and Promise,” Proceedings of ICCADProceedings of ICCAD, pp. 883-890, 2003., pp. 883-890, 2003.

Chang C., Cong J. Romesis M. and Xie M., “Optimality and Scalability of Existing Chang C., Cong J. Romesis M. and Xie M., “Optimality and Scalability of Existing

Placement Algorithms,” Placement Algorithms,” IEEE TCADIEEE TCAD, vol. 23, no. 4, pp. 537-549, 2004., vol. 23, no. 4, pp. 537-549, 2004.

UCLA VLSICAD LAB

Publication ListPublication List Li C., Xie M, Koh C.K., Cong J., and Madden P., “Routability-driven Placement Li C., Xie M, Koh C.K., Cong J., and Madden P., “Routability-driven Placement

and White Space Allocation,” and White Space Allocation,” Proceedings of ICCADProceedings of ICCAD, pp. 883-890, 2004., pp. 883-890, 2004.

J. Cong, J. Fang, M. Xie, and Y. Zhang, J. Cong, J. Fang, M. Xie, and Y. Zhang, "MARS - A Multilevel Full-Chip Gridless Routing System," IEEE TCAD,IEEE TCAD, Vol. 24, No. 3, pp. 382-394, March 2005. Vol. 24, No. 3, pp. 382-394, March 2005.

J. Cong, T. Kong, J. Shinnerl, M. Xie, and X. Yuan, "Large Scale Circuit J. Cong, T. Kong, J. Shinnerl, M. Xie, and X. Yuan, "Large Scale Circuit Placement," Placement," ACM TODAES,ACM TODAES, Vol. 10, No. 2, pp. 389-430, April 2005. Vol. 10, No. 2, pp. 389-430, April 2005.

Li C., Xie M, Koh C.K., Cong J., and Madden P., “Routability-driven Placement Li C., Xie M, Koh C.K., Cong J., and Madden P., “Routability-driven Placement and White Space Allocation,” and White Space Allocation,” IEEE TCAD, to appearIEEE TCAD, to appear..

T. Chan , J. Cong M. Romesis J. Shinnerl, K. Sze, M. Xie, “mPL6: A Robust T. Chan , J. Cong M. Romesis J. Shinnerl, K. Sze, M. Xie, “mPL6: A Robust Multilevel Mixed-size Placement Engine,” Multilevel Mixed-size Placement Engine,” Proceedings of ISPDProceedings of ISPD, pp. 227-229, April , pp. 227-229, April 2005.2005.

Cong J. and Xie M., “A Robust Detailed Placement Algorithm for Mixe-size IC Cong J. and Xie M., “A Robust Detailed Placement Algorithm for Mixe-size IC Designs”, Designs”, Proceedings of ASPDACProceedings of ASPDAC, pp.188-194., 2006., pp.188-194., 2006.

J. Cong, T. Chan, J. Shinnerl, K. Sze and M. Xie, "mPL6: Enhanced Multilevel J. Cong, T. Chan, J. Shinnerl, K. Sze and M. Xie, "mPL6: Enhanced Multilevel Mixed-size PlacementMixed-size Placement,," " Proceedings of the ISPDProceedings of the ISPD, pp. 212-214, April 2006. , pp. 212-214, April 2006.

http://cadlab.cs.ucla.edu/~cong/papers/pc3.pdf

UCLA VLSICAD LAB

Relative Wirelength

mPL 1.0 [ICCAD00]• Recursive ESC clustering• NLP at coarsest level• Goto discrete relaxation• Slot Assignment legalization• Domino detailed placement

year2000 2001 2002 2003 2004

mPL 1.1• FC-Clustering• added partitioning to legalization

mPL 2.0 • RDFL relaxation• primal-dual netlist pruning

mPL 3.0 [ICCAD 03]• QRS relaxation• AMG interpolation• multiple V-cycles• cell-area fragmentation

UNIFORM CELL SIZE

NON-UNIFORM CELL SIZE

mPL 4.0• improved DP• better coarsening • backtracking V-cycle

mPL5,mPL6• Multilevel Force-Directed

A Brief History of mPLA Brief History of mPL

UCLA VLSICAD LAB

Multiscale Optimization FrameworkMultiscale Optimization Framework

Interpolation &Relaxation (optimization)

Coarsening(Clustering)

Prob

lem

siz

e de

crea

ses

• Explores different scales of the solution space at different levels• Supports VERY FAST and SCALABLE methods• Supports inclusion of complicated objectives and constraints• Successful across MANY DIVERSE applications

Given problem

UCLA VLSICAD LAB

mPL6 – Generalized Force Directed RefinementmPL6 – Generalized Force Directed Refinement

Logsum wirelengthLogsum wirelength

Average bin densityAverage bin density

Equality constraintEquality constraint Average bin density = utilization Average bin density = utilization

ratioratio

1

1

3

2

432

v6

v5

v4

v3

v2

v1v7

= a13(v7) = fractional area of cell v7 in bin B13

7

1

area)bin /()(k

kijij vaD

UCLA VLSICAD LAB

mPL6 – Iterative FlowmPL6 – Iterative Flow

Level 3

Level 2

Level 1

C

C

I

I

C+I

C+I

I

I

Bestchoice clustering [Alpert et al, ISPD05]Bestchoice clustering [Alpert et al, ISPD05]

AMG declustering [Chen et al, DAC03, Chan et al ICCAD03]AMG declustering [Chen et al, DAC03, Chan et al ICCAD03]

Multiple V cycle with distance based reclustering [Chan et al, ICCAD03]Multiple V cycle with distance based reclustering [Chan et al, ICCAD03]

UCLA VLSICAD LAB


Chapter 2. Optimality and scalability study of existing placement algorithmsChapter 2. Optimality and scalability study of existing placement algorithms

Chapter 3. Routability driven multilevel global placement and white space Chapter 3. Routability driven multilevel global placement and white space allocation allocation Motivation and previous workMotivation and previous work Routability-driven multilevel placementRoutability-driven multilevel placement Experiment resultsExperiment results Conclusions and future workConclusions and future work






UCLA VLSICAD LAB

MotivationMotivation

mPL does not consider routing congestion mPL does not consider routing congestion Aggressive HPWL minimization != routabilityAggressive HPWL minimization != routability

Routability-driven placementRoutability-driven placement Routability modelingRoutability modeling

Routability optimizationRoutability optimization

UCLA VLSICAD LAB

Previous Work -- Routability ModelingPrevious Work -- Routability Modeling Topology-free methods Topology-free methods

Dragon [Yang et al., TCAD03] Dragon [Yang et al., TCAD03] Sparse [Hu et al., ICCAD02] Sparse [Hu et al., ICCAD02] BonnPlace [Brenner & Rohe, ISPD02]BonnPlace [Brenner & Rohe, ISPD02]

Topology-based methods Topology-based methods [Mayrhofer & Lauther, ICCAD90] [Mayrhofer & Lauther, ICCAD90] mPG [Chang et al., ISPD02]mPG [Chang et al., ISPD02]

UCLA VLSICAD LAB

Previous Work -- Routability OptimizationPrevious Work -- Routability Optimization

Cell weightingCell weighting Cell inflation based on congestionCell inflation based on congestion

Constructive and iterative methodsConstructive and iterative methods• Dragon [Yang et al, TCAD03]Dragon [Yang et al, TCAD03]• BonnPlace [Brenner & Rohe, ISPD02]BonnPlace [Brenner & Rohe, ISPD02]

Net weightingNet weighting Translate into bin weights and optimize weighted wirelengthTranslate into bin weights and optimize weighted wirelength

Iterative methodsIterative methods• Sparse [Hu & Sadowska, ICCAD02]Sparse [Hu & Sadowska, ICCAD02]• mPG [Chang et al, ISPD02]mPG [Chang et al, ISPD02]

UCLA VLSICAD LAB

Routability-Driven Multilevel PlacementRoutability-Driven Multilevel Placement

Global placementGlobal placement Congestion estimation by a fast LZ routerCongestion estimation by a fast LZ router

Congestion-driven cell re-placement based on weighted Congestion-driven cell re-placement based on weighted wirelengthwirelength

Hierarchical top-down white space allocationHierarchical top-down white space allocation Geometric-based slicing tree Geometric-based slicing tree

Congestion estimation on treeCongestion estimation on tree

Cutline adjustmentCutline adjustment

UCLA VLSICAD LAB

mPL-R Congestion Estimation with LZ RoutermPL-R Congestion Estimation with LZ Router

Use LZ-Router [Chang et al., ISPD02] for fast congestion analysis on each level

Binary search on V-stem (or H-stem) Initialize left region and right

region to cover bounding box Repeat

• Query wire usage on both regions

• Select region with less congestion

Left region Right region

HVH VHV

Less congested

More congested

UCLA VLSICAD LAB

mPL-R Congestion-Driven Re-PlacementmPL-R Congestion-Driven Re-Placement

Pick cells whose incident nets Pick cells whose incident nets

cross congested regions to movecross congested regions to move

Start from the optimal location Start from the optimal location

for HPWLfor HPWL

Search adjacent bins within Search adjacent bins within

certain windowcertain window

0.5 1.2 2.0

WLc = 15.5

Choose the bin based on Choose the bin based on

weighted WLweighted WL

WLc = 9.2

UCLA VLSICAD LAB

White Space Allocation -- Slicing Tree ConstructionWhite Space Allocation -- Slicing Tree Construction

root

A B

C D

E F

G H

A B C D E F G H

Recursively bipartition chip region from top to bottom.

Estimate congestion on leaf nodes. Congestion on other nodes can be computed from bottom to top.

Cut direction

Cut location

Node area

Congestion

Group cells into children nodes according to location relative to cutline.

UCLA VLSICAD LAB

A B

C D

E F

G H

A B C D E F G H

White Space Allocation – Cutline AdjustmentWhite Space Allocation – Cutline Adjustment

Adjust cut location from top to bottom such that white spaces for children nodes are proportional to their overflow.

root240/88

116/28 124/60

cell area/congestion

Assuming chip area of root = 300

Total WS area = 300 – 240 = 60

WS area for left child = 60*28/(28+60) = 19.1

WS area for right child= 40.9

Chip area for left child = 116+19.1 = 135.1

Chip area for right child = 124+40.9 = 164.9

A B

C D

E F

G H

UCLA VLSICAD LAB

A B

C D

E F

G H

A B C D E F G H


Adjust cut location from top to bottom such that white spaces for children nodes are proportional to their congestions.

root240/88

116/28 124/60

54/9 62/19 58/34 66/26


UCLA VLSICAD LAB

A B

C D

E F

G H

A B C D E F G H


Adjust cut location from top to bottom such that white spaces for children nodes are proportional to their congestions.

root240/88

116/28 124/60


UCLA VLSICAD LAB

Experiment SetupExperiment Setup 16 IBM version 2 examples16 IBM version 2 examples

5% to 15% white space5% to 15% white space

Three state-of-the-art routability-driven placersThree state-of-the-art routability-driven placers Dragon-fd 3.01 [Yang et al, TCAD03]Dragon-fd 3.01 [Yang et al, TCAD03]

• Simulated annealing with bin swappingSimulated annealing with bin swapping• Two-step white space allocationTwo-step white space allocation

Capo 10.0 [Roy et al, ISPD06]Capo 10.0 [Roy et al, ISPD06]• Fast steiner tree approximation Fast steiner tree approximation • Congestion based cutline shiftingCongestion based cutline shifting

Fengshui 5.1 [Agnihotri et al, ISPD05] Fengshui 5.1 [Agnihotri et al, ISPD05] • Recursive bi-sectionRecursive bi-section• Similar white space allocation method incorporatedSimilar white space allocation method incorporated

Magma router for evaluationMagma router for evaluation

UCLA VLSICAD LAB

Routability-Driven Placement Tools ComparisonRoutability-Driven Placement Tools Comparison

route WL

0.900

0.950

1.000

1.050

1.100

1.150

Dragon3.01

Capo 10.0 Fengshui5.1

mPL-R+WSA

#violation

-200

0200

400600

8001000

12001400

1600

Dragon 3.01 Capo 10.0 Fengshui 5.1 mPL-R+WSA

mPL-R+WSA is the only flow to produce all successful routing

mPL-R+WSA produces the shortest wirelength

UCLA VLSICAD LAB

Routability Optimization Techniques ComparisonRoutability Optimization Techniques Comparison

mPLmPL Latest pure WL-driven versionLatest pure WL-driven version

No consideration of routing congestionNo consideration of routing congestion

mPL-RmPL-R

mPL-ImPL-I Cell inflation + dummy density assignmentCell inflation + dummy density assignment

Highest quality in ISPD06 contest [Nam ISPD06]Highest quality in ISPD06 contest [Nam ISPD06]

Density target set as utilizationDensity target set as utilization

mPL+WSAmPL+WSA

mPL-R+WSAmPL-R+WSA

UCLA VLSICAD LAB

Routability Optimization Techniques ComparisonRoutability Optimization Techniques Comparison

route WL

0.96

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

mPL mPL-R mPL-I mPL+WSA mPL-R+WSA

#success

0

2

4

6

8

10

12

14

16

18

mPL mPL-R mPL-I mPL+WSA mPL-R+WSA

mPL-I with heuristic penalty term does not perform very well

Both mPL-R and WSA improves routability significantly

Combined workflow gives the highest completion rate

UCLA VLSICAD LAB



Chapter 3. Routability driven multilevel global placement and white space Chapter 3. Routability driven multilevel global placement and white space

allocation allocation


Chapter 5. Applications of mixed-size placement legalizationChapter 5. Applications of mixed-size placement legalization Enhancement for macro legalization algorithmEnhancement for macro legalization algorithm Additional experiment resultsAdditional experiment results




UCLA VLSICAD LAB

Enhancement for Macro LegalizationEnhancement for Macro Legalization

Constraint graph reductionConstraint graph reduction Original constraint graph Original constraint graph

• One edge for each pair of macrosOne edge for each pair of macros• O(nO(n22) in total) in total

Reduced constraint graphReduced constraint graph• Edge inserted only when no Edge inserted only when no

transitive closure presenttransitive closure present• Significant reduction of memory Significant reduction of memory

consumptionconsumption

?A

B

C

UCLA VLSICAD LAB

Experiment Result with ICCAD04-MSExperiment Result with ICCAD04-MSold new old new

ibm01 30184 6600 2.16E+06 2.15E+06ibm02 36631 7459 4.78E+06 4.64E+06ibm03 41947 8414 6.47E+06 6.68E+06ibm04 43400 7819 7.15E+06 7.23E+06ibm05 9.32E+06 9.32E+06ibm06 15799 4141 5.67E+06 5.66E+06ibm07 42249 8100 9.80E+06 9.64E+06ibm08 45214 8077 1.13E+07 1.12E+07ibm09 31940 6330 1.21E+07 1.21E+07ibm10 308569 29383 2.94E+07 2.90E+07ibm11 69439 10256 1.76E+07 1.76E+07ibm12 211641 23926 3.14E+07 3.13E+07ibm13 89730 12769 2.27E+07 2.24E+07ibm14 188299 20803 3.58E+07 3.57E+07ibm15 77081 11549 4.60E+07 4.61E+07ibm16 104719 14981 5.38E+07 5.35E+07ibm17 288492 30100 6.34E+07 6.35E+07ibm18 40517 7346 4.29E+07 4.29E+07Avg 1 0.16 1.00 1.00

circuit#edge WL

84% reduction of constraint edges84% reduction of constraint edges

No degradation of solution qualityNo degradation of solution quality

UCLA VLSICAD LAB

Enhancement for Macro LegalizationEnhancement for Macro Legalization

vijij

vijijijij

hijij

hijijijij

vijji

ji

hijji

ji

iiii

iiii

n

i

n

ijiijij

n

iiiii

Gefy

GeyyVfy

Gefx

GexxHfx

Gehh

'-yy

Geww

'-xx

nidyyydy

nidxxxdxts

fyfxdywydxwx

0

)''(

0

)''(

2

'

2

'

1 '

1 ' ..

min1 ,11

fij

xHij

Used in ISPD 2006 placement contestUsed in ISPD 2006 placement contest

UCLA VLSICAD LAB

ISPD05 ExamplesISPD05 Examplescircuit #cell #macro #pad #net #row utilization

adaptec1 210967 63 480 221142 890 76%adaptec2 254616 159 407 266009 1170 79%adaptec3 451746 723 0 466758 1944 75%adaptec4 496141 1329 0 515951 1944 63%bigblue1 277636 32 528 284479 890 55%bigblue2 557962 23084 0 577235 1566 62%bigblue3 1096908 3778 0 1123170 2316 86%bigblue4 2177449 8170 0 2229886 2694 66%

Bigger problem sizeBigger problem size

Suitable to test scalabilitySuitable to test scalability

UCLA VLSICAD LAB

Scalability Comparison on ISPD05Scalability Comparison on ISPD05-- Global Placements by APlace-- Global Placements by APlace

FWL RT(s) FWL RT(s)Adaptec1 7.83E+07 1846 7.91E+07 417Adaptec2 9.57E+07 3616 9.66E+07 406Adaptec3 2.19E+08 10142 2.22E+08 708Adaptec4 2.09E+08 13115 2.13E+08 955Bigblue1 1.00E+08 2767 1.05E+08 506Bigblue2 1.54E+08 13848 1.56E+08 1846Bigblue3 4.12E+08 27186 3.84E+08 3133Bigblue4 8.71E+08 103002 8.84E+08 5732

Avg. 1.00 1.00 1.01 0.12

circuitAPlace XDP

XDP produces 1% longer WL, but is 10X fasterXDP produces 1% longer WL, but is 10X faster

UCLA VLSICAD LAB

Scalability Comparison on ISPD05Scalability Comparison on ISPD05-- Global Placements by mPL-- Global Placements by mPL

FWL RT(s) FWL RT(s)Adaptec1 7.84E+07 1665 7.85E+07 430Adaptec2 9.28E+07 2906 9.28E+07 518Adaptec3 2.15E+08 8050 2.17E+08 944Adaptec4 1.94E+08 9784 1.97E+08 836Bigblue1 9.86E+07 2031 9.74E+07 483Bigblue2 1.53E+08 11713 1.54E+08 2200Bigblue3 3.50E+08 31414 3.49E+08 2371Bigblue4 8.50E+08 60249 8.37E+08 5363

Avg. 1.00 1.00 1.00 0.15

circuitAPlace XDP

XDP can be 10x faster with comparable qualityXDP can be 10x faster with comparable quality

UCLA VLSICAD LAB

Impact of Gradual Macro Legalization – ISPD05Impact of Gradual Macro Legalization – ISPD05

WL runtime(s) WL runtime(s)adaptec1 7.79E+07 2895 7.22E+07 3279adaptec2 9.20E+07 3002 8.81E+07 3500adaptec3 2.14E+08 9410 1.63E+08 12492adaptec4 1.94E+08 8844 1.60E+08 12674bigblue1 9.68E+07 3637 9.91E+07 3704bigblue2 1.52E+08 10326 1.12E+08 15475bigblue3 3.44E+08 13565 3.14E+08 19948bigblue4 8.29E+08 30664 7.60E+08 35766

Avg. 1.00 1.00 0.88 1.28

circuitfixed movable

12 % WL reduction possible with macros movable12 % WL reduction possible with macros movable

UCLA VLSICAD LAB



Chapter 3. Routability driven multilevel global placement and white space Chapter 3. Routability driven multilevel global placement and white space allocation allocation




Chapter 7. Heterogeneous placement for FPGAsChapter 7. Heterogeneous placement for FPGAs Motivation and previous worksMotivation and previous works Multilevel heterogeneous placement – mPL-HMultilevel heterogeneous placement – mPL-H Experiment resultsExperiment results Conclusions and future workConclusions and future work


UCLA VLSICAD LAB

MotivationMotivation

Popularity of FPGAsPopularity of FPGAs Ease of useEase of use

Low cost for small to medium productionLow cost for small to medium production

Modern FPGA placement impose heterogeneous Modern FPGA placement impose heterogeneous

constraintsconstraints Memory block of different capacity, DSP blocksMemory block of different capacity, DSP blocks

Each block should only be placed on sites of the same typeEach block should only be placed on sites of the same type

UCLA VLSICAD LAB

Example FPGA ChipExample FPGA Chip

Figure taken from Altera Stratix Handbook

UCLA VLSICAD LAB

Previous Works -- AcademiaPrevious Works -- Academia Simulated annealingSimulated annealing

VPR [Betz & Rose, FPL97, Marquardt et al, FPGA00]VPR [Betz & Rose, FPL97, Marquardt et al, FPGA00] PATH [Kong, ICCAD02]PATH [Kong, ICCAD02] SPCD [Chen & Cong, FPL04, FPGA05]SPCD [Chen & Cong, FPL04, FPGA05]

PartitioningPartitioning PPFF [Maidee et al, DAC03]PPFF [Maidee et al, DAC03]

Graph embeddingGraph embedding CAPRI [CAPRI [Gopalakrishnan et al, DAC06]

MultilevelMultilevel Ultrafast-VPR [Sankar & Rose, FPGA99]Ultrafast-VPR [Sankar & Rose, FPGA99] mPG-ms [Cong & Yuan, ASPDAC03]mPG-ms [Cong & Yuan, ASPDAC03]

None of them handle heterogeneous constraintNone of them handle heterogeneous constraint

UCLA VLSICAD LAB

Previous Works -- IndustryPrevious Works -- Industry

Quartus II by Altera CorporationQuartus II by Altera Corporation Stratix, Stratix II, etc.Stratix, Stratix II, etc.

ISE by Xilinx CorporationISE by Xilinx Corporation Virtex II, Virtex II Pro, etc.Virtex II, Virtex II Pro, etc.

Do have heterogeneous capabilityDo have heterogeneous capability Only for proprietary chip architectureOnly for proprietary chip architecture

Algorithms and techniques not publicly documentedAlgorithms and techniques not publicly documented

UCLA VLSICAD LAB

Multilevel Heterogeneous Placement – mPL-HMultilevel Heterogeneous Placement – mPL-H

Based on multilevel generalized force directed placementBased on multilevel generalized force directed placement

Multi-layered placement to handle heterogeneous Multi-layered placement to handle heterogeneous

placementplacement

Filler cells to enhance quality and stabilityFiller cells to enhance quality and stability

Gradual carry chain legalizationGradual carry chain legalization

UCLA VLSICAD LAB

Limitations of mPL for Heterogeneous PlacementLimitations of mPL for Heterogeneous Placement

Does not consider heterogeneous constraintsDoes not consider heterogeneous constraints Any block can be placed anywhereAny block can be placed anywhere

Requires density to be uniform everywhereRequires density to be uniform everywhere Penalize wirelength for low utilizationPenalize wirelength for low utilization

UCLA VLSICAD LAB

mPL-H -- Global Placement (I)mPL-H -- Global Placement (I) Multiple layers, each layer for Multiple layers, each layer for

each resourceeach resource DSP layerDSP layer

M-RAM layerM-RAM layer

LAB layerLAB layer

M4K layerM4K layer

M512 layerM512 layer

Forbidden regions blocked by Forbidden regions blocked by

obstaclesobstacles

Uniform wirelength computationUniform wirelength computation

DSP

M-RAM

LAB

UCLA VLSICAD LAB

mPL-H -- Global Placement (II)mPL-H -- Global Placement (II)

Filler cell Filler cell

Occupy the residual capacityOccupy the residual capacity

Transform inequality into Transform inequality into equalityequality

Density computed Density computed independently on each layerindependently on each layer

Granularity may not be fine Granularity may not be fine enoughenough

ijij CD

ijijij CdD

UCLA VLSICAD LAB

mPL-H -- Legalization (I)mPL-H -- Legalization (I)

DSP and memory blocksDSP and memory blocks Domains do not overlapDomains do not overlap

• Legalized independentlyLegalized independently

Uniform size for the same typeUniform size for the same type• Linear assignment O(nLinear assignment O(n33))• Cost as distanceCost as distance

cellssites

UCLA VLSICAD LAB

mPL-H -- Legalization (II)mPL-H -- Legalization (II) Carry chainsCarry chains

Vary in lengthVary in length Legalized in descending order Legalized in descending order

of lengthof length Partition each column into Partition each column into

same sizesame size Assign chains of same length Assign chains of same length

using linear assignment using linear assignment

UCLA VLSICAD LAB

mPL-H -- Legalization (III)mPL-H -- Legalization (III)

Column-wise rearrangement of carry chainsColumn-wise rearrangement of carry chains P(n,m) is the minimum perturbation of assign (vP(n,m) is the minimum perturbation of assign (v11,…v,…vn) to sites (s) to sites (s11,s,s22,…,…

ssmm))

P(1,j) = d(1,j), d(1,j) is the perturbation of assigning vP(1,j) = d(1,j), d(1,j) is the perturbation of assigning v11 to site s to site sjj

P(i,j) = min{P(i-1,j-hP(i,j) = min{P(i-1,j-hii), P(i, j-1)}), P(i, j-1)}

Can be solved more efficiently for some special casesCan be solved more efficiently for some special cases• Quadratic distanceQuadratic distance• No site constraintNo site constraint

UCLA VLSICAD LAB

Experiment SettingExperiment Setting

Quartus_map

Verilog netlist

Quartus_fitter mPL-HClustered .vqm netlist

Quartus_router

Chip type

Architecture

Description

XML

.qsf placement .qsf placement

UCLA VLSICAD LAB

QUIP Suite QUIP Suite circuit LUT LAB I/O Mem bits DSP

fip_risc8 1791 219 113 384 0mux64_16bit 1188 150 87 0 0mux8_128bit 1155 141 140 0 0oc_cordic_p2r 1016 111 82 0 0oc_cordic_r2p 1424 157 74 0 0oc_aes_core 1474 181 388 32768 0

oc_aes_core_inv 1614 183 389 34176 0oc_aquarius 5530 646 35 131072 8

oc_cfft_1024x12 1601 191 68 24576 0oc_des_des3area 1072 120 304 0 0oc_des_des3perf 13791 1569 298 2744 0oc_des_perf_opt 4566 550 185 0 0

oc_fpu 6693 793 110 0 8oc_mem_ctrl 3222 387 267 0 0

oc_mips 3294 387 201 1152 0oc_oc8051 2742 331 164 4608 0

oc_video_compression_systems_dct 36410 4440 25 18688 0oc_video_compression_systems_jpeg 32178 3924 47 16640 0

oc_wb_dma 3087 396 444 0 0os_blowfish 1445 168 585 67168 0

UCLA VLSICAD LAB

Wirelength ComparisonWirelength ComparisonPWL RWL PWL ratio RWL ratio

fip_risc8 219 6322 15828 5872 0.93 15304 0.97mux64_16bit 150 4560 9464 4582 1.00 9728 1.03mux8_128bit 141 3608 7328 3541 0.98 6556 0.89

oc_cordic_p2r 111 2981.5 6852 2786 0.93 6264 0.91oc_cordic_r2p 157 4239 8848 3889 0.92 8260 0.93oc_aes_core 181 16362 27100 15537 0.95 25944 0.96

oc_aes_core_inv 183 18639 30288 17962 0.96 30680 1.01oc_aquarius 646 32683.5 78616 31703 0.97 78280 1.00

oc_cfft_1024x12 191 5914.5 12256 5757 0.97 11988 0.98oc_des_des3area 120 8273 15588 8139 0.98 15472 0.99oc_des_des3perf 1569 62982 116312 59611 0.95 118028 1.01oc_des_perf_opt 550 17443 31548 17096 0.98 32476 1.03

oc_fpu 793 26570.1 64324 24583 0.93 63628 0.99oc_mem_ctrl 387 16501.5 38428 16808 1.02 39132 1.02

oc_mips 387 18550 43916 18639 1.00 43260 0.99oc_oc8051 331 13640 31172 13332 0.98 30520 0.98

oc_video_compression_systems_dct 4440 165835 423292 172902 1.04 436708 1.03oc_video_compression_systems_jpeg 3924 139076 370064 142234 1.02 368528 1.00

oc_wb_dma 396 25614 57128 24999 0.98 57704 1.01os_blowfish 168 27023 43832 22569 0.84 39804 0.91oc_ethernet 265 12013 24812 11935 0.99 25128 1.01

Avg. 0.97 0.98

Quartus 5.0 mPL-Hcircuit #LAB

mPL-H is 3% better in HPWL, and 2% better in routed WL than Quartus II v5.0

UCLA VLSICAD LAB

Runtime ComparisonRuntime Comparison

0200400600800

10001200140016001800

111

141

157

181

191

265

387

396

646

1569

4440

#LAB

run

tim

e(s)

Quartus II 5.0 mPL-H

mPL-H can be 2X faster than Quartus II v5.0

when the circuit becomes sufficiently large

UCLA VLSICAD LAB

Optimality Study of mPL-HOptimality Study of mPL-H

PEKO-H constructionPEKO-H construction Populate all sites with Populate all sites with

corresponding resource typecorresponding resource type

Generate each net with Generate each net with optimal wirelengthoptimal wirelength

Extract the netlist in the endExtract the netlist in the end

UCLA VLSICAD LAB

Experiment Results with PEKO-HExperiment Results with PEKO-Hcircuit LAB M4K M512 DSP M-RAM OWL PWL runtime(s)

PEKO-H01 974 58 90 2 1 3151 3877 109PEKO-H02 849 43 75 4 1 1480 2302 206PEKO-H03 926 50 81 5 1 2042 3122 301PEKO-H04 1618 66 180 8 2 3895 6711 1227PEKO-H05 1690 72 186 10 2 4807 7085 363PEKO-H06 988 58 90 6 1 12099 14699 219PEKO-H07 957 53 85 6 1 2698 3454 101PEKO-H08 943 46 87 5 1 2779 3551 116PEKO-H09 1754 80 188 10 2 36593 42456 578PEKO-H10 988 58 90 6 1 11288 13126 163PEKO-H11 988 58 90 6 1 12410 14850 204PEKO-H12 988 58 90 6 1 7560 8745 184PEKO-H13 988 58 90 6 1 7894 8992 141PEKO-H14 986 56 90 6 1 6521 7836 162PEKO-H15 5550 288 564 18 6 49120 85280 1788PEKO-H16 5549 288 564 18 6 42123 66923 1656PEKO-H17 1730 78 188 10 2 7329 8598 380PEKO-H18 2241 109 210 5 2 4698 6788 1202PEKO-H19 986 56 90 6 1 5356 6029 154

Avg. 1.00 1.34

mPL-H produces HPWL 34% longer than the optima

UCLA VLSICAD LAB

Displacement of PEKO-H13Displacement of PEKO-H13

UCLA VLSICAD LAB

Displacement of PEKO-H16Displacement of PEKO-H16

Swirls are difficult for local refinement to recover

UCLA VLSICAD LAB

ConclusionsConclusions

First analytical work for heterogeneous placementFirst analytical work for heterogeneous placement

Compared to leading edge Quartus II v5.0 for StratixCompared to leading edge Quartus II v5.0 for Stratix 3 % shorter HPWL, 2 % shorter routed WL3 % shorter HPWL, 2 % shorter routed WL

Can be 2X faster when example becomes sufficiently largeCan be 2X faster when example becomes sufficiently large

Optimality study with PEKO-HOptimality study with PEKO-H Displacement observed from the optimaDisplacement observed from the optima

34% longer HPWL than the optima34% longer HPWL than the optima

UCLA VLSICAD LAB

Future WorkFuture Work

Accurate timing analysisAccurate timing analysis Only point-to-point delay table released Only point-to-point delay table released

• OK for overlap-free intermediate resultsOK for overlap-free intermediate results• Not accurate enough for analytical placerNot accurate enough for analytical placer

Guide timing-driven placementGuide timing-driven placement

Routing congestionRouting congestion Proprietary routing resource information not publicly availableProprietary routing resource information not publicly available

The EndThe EndThank You!Thank You!

Documents

Constraint-Driven Large Scale Circuit Placement Algorithms