Upload
briar
View
25
Download
0
Embed Size (px)
DESCRIPTION
xPilot A Platform-Based Behavioral Synthesis System. Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005. Supported by NSF, GSRC, Altera, Xilinx. Outline. Motivation xPilot system framework Overview of the synthesis engine Scheduling - PowerPoint PPT Presentation
Citation preview
xPilotxPilot A Platform-Based Behavioral Synthesis A Platform-Based Behavioral Synthesis SystemSystem
Prof. Jason CongProf. Jason Cong
Students: Deming Chen, Yiping Fan, Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru ZhangGuoling Han, Wei Jiang, Zhiru Zhang
August, 2005August, 2005
Supported by NSF, GSRC, Altera, Xilinx.
2
OutlineOutline
MotivationMotivation
xPilot system frameworkxPilot system framework
Overview of the synthesis engineOverview of the synthesis engine SchedulingScheduling
Resource bindingResource binding
Experimental resultsExperimental results
3
Motivation (1)Motivation (1)
Design Complexity is outgrowing the traditional RTL Design Complexity is outgrowing the traditional RTL
methodmethod Feasible to build SoC device with 500M transistors; Billion-Feasible to build SoC device with 500M transistors; Billion-
transistor chips are on the horizontransistor chips are on the horizon
Behavioral synthesis Behavioral synthesis a critical technology for enabling the move a critical technology for enabling the move to higher level of abstractionto higher level of abstraction
Reasons for previous failuresReasons for previous failures• Lack of a compelling reason: design complexity is still manageable a Lack of a compelling reason: design complexity is still manageable a
decade of agodecade of ago• Lack of a solid RTL foundationLack of a solid RTL foundation• Lack of consideration of physical realityLack of consideration of physical reality
4
Motivation (2)Motivation (2)
Behavioral Synthesis provides combined advantagesBehavioral Synthesis provides combined advantages Better complexity managementBetter complexity management
• Code size: RTL design ~300KL Code size: RTL design ~300KL Behavioral design 40KL [NEC, Behavioral design 40KL [NEC, ASPDAC04]ASPDAC04]
Shorter verification/simulation cycleShorter verification/simulation cycle• Simulation speed 100X faster than RTL-based methodSimulation speed 100X faster than RTL-based method
Rapid system explorationRapid system exploration• Quick evaluation of different hardware/software boundariesQuick evaluation of different hardware/software boundaries• Fast exploration of multiple micro-architecture alternativesFast exploration of multiple micro-architecture alternatives
Higher quality of resultsHigher quality of results• Full consideration of physical realityFull consideration of physical reality
5
xPilot: Platform-Based Behavioral to RTL Synthesis Flow xPilot: Platform-Based Behavioral to RTL Synthesis Flow
Behavioral spec. in C/SystemC
RTL
SSDMSSDM
Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …
Presynthesis optimizations Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis …
FPGAs/ASICsFPGAs/ASICs
Frontendcompiler
Frontendcompiler
Platform description
Core synthesis optimizations Scheduling Resource binding, e.g., functional unit
binding register/port binding
6
System-level Synthesis Data ModelSystem-level Synthesis Data ModelSSDMSSDM (System-level Synthesis Data Model) (System-level Synthesis Data Model)
Hierarchical netlist of concurrent processes and communication Hierarchical netlist of concurrent processes and communication channelschannels
Each leaf process contains a sequential program which is represented Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semanticsby an extended LLVM IR with hardware-specific semantics• Port / IO interfaces, bit-vector manipulations, cycle-level notationsPort / IO interfaces, bit-vector manipulations, cycle-level notations
7
Platform Modeling & CharacterizationPlatform Modeling & Characterization
Target platform specificationTarget platform specification High-level resource library with delay/latency/area/power curve High-level resource library with delay/latency/area/power curve
for various input/bitwidth configurationsfor various input/bitwidth configurations• Functional units: adders, ALUs, multipliers, comparators, etc.Functional units: adders, ALUs, multipliers, comparators, etc.• Connectors: mux, demux, etc.Connectors: mux, demux, etc.• Memories: registers, synchronous memories, etc.Memories: registers, synchronous memories, etc.
Chip layout descriptionChip layout description• On-chip resource distributionsOn-chip resource distributions• On-chip interconnect delay/power estimationOn-chip interconnect delay/power estimation
8
Scheduling Scheduling Goals Goals A highly versatile scheduling engineA highly versatile scheduling engine
Applicable to a wide range of application domainsApplicable to a wide range of application domains• Computation-intensive, data/memory-intensive, control-intensive, etc.Computation-intensive, data/memory-intensive, control-intensive, etc.• Mixed behavioral & RTLMixed behavioral & RTL
Amenable to a rich set of scheduling constraintsAmenable to a rich set of scheduling constraints• Data dependency constraintsData dependency constraints• Resource constraints: IO ports constraints, memory ports constraints, Resource constraints: IO ports constraints, memory ports constraints,
Functional unit constraints, etc.Functional unit constraints, etc.• Timing constraints: Frequency constraint, Latency constraints, etc.Timing constraints: Frequency constraint, Latency constraints, etc.• Relative IO timing constraints: Cycle-fixed mode, superstate-fixed Relative IO timing constraints: Cycle-fixed mode, superstate-fixed
mode, mode, free-floating mode, etc.free-floating mode, etc.
Retargetable to a variety of design objectivesRetargetable to a variety of design objectives• High performance, small area, low power, etc.High performance, small area, low power, etc.
9
Scheduling Scheduling Optimization Capabilities Optimization Capabilities Offers a variety of optimization techniques in a unified Offers a variety of optimization techniques in a unified
frameworkframework Combinational/Sequential non-pipelined/pipelined Combinational/Sequential non-pipelined/pipelined
multi-cycle operation multi-cycle operation Unconditional/Conditional operation chaining Unconditional/Conditional operation chaining Relative schedulingRelative scheduling Considerations of branching probabilities and repetitionsConsiderations of branching probabilities and repetitions Multi-cycle communicationMulti-cycle communication (under development) (under development) Code motion & speculationCode motion & speculation (under development) (under development) Functional / loop pipeliningFunctional / loop pipelining (under development) (under development) Physical layout integration Physical layout integration (to be supported)(to be supported)
10
Scheduling Scheduling Current Status Current Status
Design objectiveDesign objective Focus on high-performance designsFocus on high-performance designs
Overall approachOverall approach Use a system of pairwise difference constraints to express all Use a system of pairwise difference constraints to express all
kinds of scheduling constraintskinds of scheduling constraints
Represent the design objective in a linear functionRepresent the design objective in a linear function
The system is immediately solvable via any linear programming The system is immediately solvable via any linear programming solver with integral solutionssolver with integral solutions
11
Scheduling Scheduling Design Framework Design Framework
xPilot scheduler
STG (State Transition Graph)
System of pairwise difference constraints
Relative timing constraintsRelative timing constraintsDependency constraintsDependency constraintsFrequency constraintsFrequency constraints
Resource constraints …Resource constraints …
Constraint equations generation
Objective function generation
CDFG
Linear programming solver
LP solution interpretation
User-specified design
constraints& assignments
Target platformmodeling(resource library &
chip layout)
12
Example : Greatest Common DivisorExample : Greatest Common Divisor
GCD C descriptionGCD C description
x = inport1;y = inport2;while (x != y) { if ( x > y ) x = x – y; else y = y – x;}*outport = x;
x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);
x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1);
x_2 = x1 – y1;cond3 = (x_2 != y_1);
y_2 = y1 – x1;cond4 = (x_1 != y_2);
x_3 = (x_0, x_1, x_2);*outport = x_3;
T
T
T T
BB1
BB2
BB3 BB4
BB5
13
Constraints GenerationConstraints Generation Data dependency constraint Data dependency constraint
Operation Operation vv is data dependent on operation is data dependent on operation u, i.e., (u, v)u, i.e., (u, v)EEs(v) – s(u) s(v) – s(u) 0 0 where schedule variable where schedule variable s(v)s(v) represents the relative schedule of represents the relative schedule of node vnode v
Other constraints can be represented in a similar way …Other constraints can be represented in a similar way …
The constraint equations form a system of pairwise difference The constraint equations form a system of pairwise difference
constraintsconstraints Matrix A is totally unimodularMatrix A is totally unimodular
Feasibility check can be formulated as a single-source shortest path problemFeasibility check can be formulated as a single-source shortest path problem
Optimizations can be performed via any LP solver; the dual problem is Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problemequivalent to a min-cost network flow problem
u: x_1 = (x_0, x_1, x_2);
v: cond2 = (x_1 > y_1);
14
Solution by LP SolverSolution by LP Solverx_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);
x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1);
x_2 = x1 – y1;cond3 = (x_2 != y_1);
y_2 = y1 – x1;cond4 = (x_1 != y_2);
x_3 = (x_0, x_1, x_2);*outport = x_3;
T
T
T T
BB1
BB2
BB3 BB4
BB5
0
1
Scheduling are Scheduling are performed across performed across the basic block the basic block boundaries boundaries
15
Schedule InterpretationSchedule Interpretation
x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1);x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2);*outport = x_3;
if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } }if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; }
x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);
x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);
16
Deriving State Transition GraphDeriving State Transition Graph Final STG for GCDFinal STG for GCD
x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);
if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } }if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; }
cond3 || cond4
17
Unified Resource BindingUnified Resource Binding
Provides an unified resource sharing framework to Provides an unified resource sharing framework to
optimize for various design objectivesoptimize for various design objectives Simultaneous functional unit binding, register binding and port Simultaneous functional unit binding, register binding and port
bindingbinding
Equipped with advanced techniques to optimized the interconnect Equipped with advanced techniques to optimized the interconnect and steering logic networksand steering logic networks
Guided by a flexible cost evaluation engine to achieve different Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc.objectives, e.g., performance, area, power, etc.
Extendable to exploit physical layout informationExtendable to exploit physical layout information
18
Case 1
R5
Case 2
R5
(a)
Case 1
R3
Case 2 R3
(b)
R1 R2 R3 R4 R1 R2 R3 R4
R1 R2 R1 R2
F1 F2 MUX MUX
MUX
F1 MUX
F1 F2
F1
An FU/Register binding ExampleAn FU/Register binding Example
Observations:Observations: Binding has large impact to the resulting performance and costBinding has large impact to the resulting performance and cost
Functional unit and register binding are highly correlatedFunctional unit and register binding are highly correlated
NoteNote: Assume all : Assume all operations and variables operations and variables are compatible for sharingare compatible for sharing
19
Drawbacks of Previous WorkDrawbacks of Previous Work Many existing algorithms focus on functional-unit- or register- Many existing algorithms focus on functional-unit- or register-
“number” minimization“number” minimization Technology advances – interconnect effect increasingTechnology advances – interconnect effect increasing
• 51% of the total dynamic power of a microprocessor in 0.13um tech.51% of the total dynamic power of a microprocessor in 0.13um tech.• Up to 80% of the dynamic power in future technologiesUp to 80% of the dynamic power in future technologies
May generate larger amount of multiplexers and interconnects May generate larger amount of multiplexers and interconnects
Unfavorable performance and cost resultsUnfavorable performance and cost results
Optimization for unrealistic goalsOptimization for unrealistic goals Minimize “number” of FUs, registers, or multiplexorsMinimize “number” of FUs, registers, or multiplexors
• Should have detailed datapath models to guide the optimizationShould have detailed datapath models to guide the optimization
No technology specific considerationNo technology specific consideration• Should have platform-specific characterizationsShould have platform-specific characterizations
20
xPilot architecture exploration
Iteration
Resource Binding in xPilotResource Binding in xPilot
No
Yes
Register Allocation/Binding
FU Allocation/Binding
Baseline Register Binding
Improved??
STG (State Transition Graph)
User-specified
designconstraints
Target platform
(resource library &
chip layout)
Datapath model for performance-cost
estimation
STG + Best Datapath Models
21
Design Space ExplorationDesign Space Exploration
MUL MUL
Datapath for solution (1, 2, 4) (3)
power
delay
pruned
A State Transition Graph A State Transition Graph (STG)(STG)
Exploration phases:Exploration phases: Exploring Node 2: Exploring Node 2:
• (1) (2) two mul(1) (2) two mul
• (1, 2) one mul(1, 2) one mul
Exploring Exploring Node 3:
• (1) (2) (3) three mul
• (1, 2) (3) two mul
• (1, 3) (2) two mul
Exploring Exploring Node 4:
• (1) (2) (3) (4)
• (1, 2, 4) (3)
• (1, 2) (3, 4)
• (1, 2) (3) (4)
• (1, 3, 4) (2)
• (1, 3) (2, 4)
• (1, 3) (2) (4)
….
C1’
C1
C2C2’
>
1*
2*, 3*4*
5*
6+
<
1*
2*
5*
3*
4*
6+
>
<
Compatible GraphsCompatible Graphs
Datapath ModelDatapath Model Curve for Design Curve for Design Space PruningSpace Pruning
22
Experimental Results Experimental Results Benchmark Suite Benchmark Suite Benchmark suiteBenchmark suite
PR, MCM:PR, MCM:• DSP kernels: pure additions/subtractions and multiplicationsDSP kernels: pure additions/subtractions and multiplications
CACHECACHE• Cache controller: control-intensive designs with cycle-accurate I/O operationsCache controller: control-intensive designs with cycle-accurate I/O operations
MOTION: MOTION: • Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest
amount of computationsamount of computations IDCT: IDCT:
• JPEG inverse discrete cosine transform: computation intensiveJPEG inverse discrete cosine transform: computation intensive DWT: DWT:
• JPEG2000 discrete wavelet transform: computation intensive with modest control JPEG2000 discrete wavelet transform: computation intensive with modest control flowflow
EDGELOOP: EDGELOOP: • Extracted from H.264 decoder: a very complex design, features a mix of Extracted from H.264 decoder: a very complex design, features a mix of
computation, control, and memory accessescomputation, control, and memory accesses
23
Experimental Results Experimental Results Code Size Reduction Code Size Reduction
24
Experimental Results Experimental Results Comparison with SPARK On Scheduling Comparison with SPARK On Scheduling
DesignsDesigns Tool/FlowTool/Flow
Synthesis Synthesis
ReportReportAltera Quartus II reportAltera Quartus II report
state#state# reg#reg# fmax (MHz)fmax (MHz) LELE registerregister memmem dspdsp
MOTIONMOTIONsparkspark 1313 1818 170.8170.8 666666 367367 00 44
xpilotxpilot 2424 1111 161.2161.2 888888 266266 00 44
PRPRsparkspark 1313 3636 130.6130.6 508508 491491 00 3232
xpilotxpilot 1313 4040 178.7178.7 1,3491,349 783783 00 00
IDCTIDCT
sparkspark 176176 ~400~400 72.0172.01 10,84710,847 4,5474,547 00 138138
xpilotxpilot 141141 413413 105.53105.53 11,48111,481 5,6275,627 00 6464
xpilot-memxpilot-mem 334334 451451 162.9162.9 9,3519,351 6,0986,098 1,0241,024 6464
CACHECACHEsparkspark Memory unsupported Memory unsupported
xpilot-memxpilot-mem 4747 1616 161.6161.6 371371 265265 30723072 00
SPARK [UCI/UCSD, 2004], a state of the art academic high-SPARK [UCI/UCSD, 2004], a state of the art academic high-
level synthesis toollevel synthesis tool
25
On average, xPilot resource binding achieves designs with similar area, and 2.48x higher On average, xPilot resource binding achieves designs with similar area, and 2.48x higher
frequency over Sparkfrequency over Spark
Designs
SPARK xPilot
Fmax Ratio xPilot/SPARK
Resource Usage Fmax Resource Usage Fmax
LE COMBLonely-
RegComb-
RegDSP (MHz) LE COMB
Lonely-Reg
Comb-Reg
DSP (MHz)
PR 1108 815 0 293 0 123.53 1349 713 84 552 0 178.7 1.45
WANG 1217 942 0 275 0 118.89 1105 527 62 516 8 166.11 1.40
LEE 1367 1052 0 315 0 119.32 1585 691 207 687 4 166.61 1.40
MCM 2808 2248 0 560 0 74.87 2402 981 73 1348 0 152.56 2.04
DIR 2425 2034 0 391 6 69.38 3489 1752 110 1627 4 146.8 2.12
FEIG 16170 13136 0 3034 6 37.17 10539 2295 240 8004 4 173.49 4.67
Total 25095 20227 0 4868 12 543.16 20469 6959 776 12734 20 984.27 1.81
Ave Ratio
1 1 1 1 1 1 1.17 0.65 n/a* 2.96 n/a* 2.48 2.48
Experimental Results Experimental Results Comparison with SPARK On Binding Comparison with SPARK On Binding
26
Synthesis Results for DWT (JPEG2000)Synthesis Results for DWT (JPEG2000)
Target cycle timeTarget cycle time State#State# fmax(MHz)fmax(MHz) Cycle#Cycle# Latency (ns)Latency (ns) LE#LE# DSP#DSP#
9ns9ns 3434 123.56123.56 48304830 39.139.1 17771777 128128
7ns7ns 3636 147.28147.28 52115211 35.435.4 18621862 128128
5.5ns5.5ns 5151 183.62183.62 69266926 37.837.8 19261926 128128
SettingsSettings Target platform: Altera StratixTarget platform: Altera Stratix RTL synthesis & place-and-route: Altera QuartusII v5.0RTL synthesis & place-and-route: Altera QuartusII v5.0 Simulation: Mentor ModelSim SE6.0Simulation: Mentor ModelSim SE6.0
Design alternativesDesign alternatives
27
Experimental Results: ASIC FlowExperimental Results: ASIC FlowMagma RTL to GDSII flow
Technology library: Cadence Generic Standard Cell Library 0.18um
Tradeoff study: 1st column: delay constraint enforced in xPilot 2nd column: control step count of xPilot generated RTL 3rd-5th column: data reported after mapping by Magma tool
DIRDIR StateState##
Cell Cell countcount Area(u2)Area(u2) Delay(ps)Delay(ps) Fmax(MHz)Fmax(MHz) Latency(ps)Latency(ps)
5ns5ns 55 1755517555 12565841256584 21112111 473.71 473.71 1055510555
10ns10ns 33 2307723077 13322031332203 21392139 467.51 467.51 64176417
15ns15ns 22 2838128381 14864871486487 21812181 458.51 458.51 43624362
20ns20ns 22 2718927189 13944511394451 25142514 397.77 397.77 50285028
30ns30ns 11 2779727797 14016421401642 27252725 366.97 366.97 27252725
28
Experimental Results: ASIC Flow (cont.)Experimental Results: ASIC Flow (cont.)
LEELEE State#State# Cell Cell countcount Area(u2)Area(u2) Delay(ps)Delay(ps) Fmax(MHz)Fmax(MHz) Latency(ps)Latency(ps)
5ns5ns 88 82428242 509807509807 20662066 484.03 484.03 1652816528
10ns10ns 44 1598915989 708870708870 22542254 443.66 443.66 90169016
15ns15ns 22 1669816698 703381703381 34233423 292.14 292.14 68466846
20ns20ns 22 1525615256 656147656147 42264226 236.63 236.63 84528452
30ns30ns 11 1608516085 697363697363 50705070 197.24 197.24 50705070
MotionMotion State#State# Cell Cell countcount Area(u2)Area(u2) Delay(ps)Delay(ps) Fmax(MHzFmax(MHz
))Latency(ps)Latency(ps)
10ns10ns 3535 1647416474 909721909721 21072107 474.61 474.61 7374573745
15ns15ns 3030 1569515695 847262847262 23582358 424.09 424.09 7074070740
20ns20ns 2828 1646316463 867898867898 24982498 400.32 400.32 6994469944
30ns30ns 2828 1580715807 852573852573 25632563 390.17 390.17 7176471764