Upload
ocean
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ESLT The next generation of Design Automation Tools. Agenda. Goal of ESL tools History Motivation for ESLT in these times The USU-ESLT On-going research Conclusions. Goal of ESL tools. To automate the generation of SoC solutions from HLPL (such as C/C++/Java..) - PowerPoint PPT Presentation
Citation preview
Goal of ESL tools History Motivation for ESLT in these times The USU-ESLT On-going research Conclusions
To automate the generation of SoC solutions from HLPL (such as C/C++/Java..)
To reduce design time of digital circuits from months to weeks◦ Initial VHDL generation should complete in
minutes◦ Functional verification/testing may take a few
weeks
Tools◦ Cones (1988)◦ HardwareC (Stanford)◦ Transmogrifier C◦ System C ◦ C2Verilog (1998)◦ Handel C◦ Bach C◦ SpecC◦ Trident C (LANL)◦ SPARK (UCI)◦ CASH (CMU)◦ Mitrion C◦ Impulse C (2004)◦ Catapult C (2006,
MG)
Challenges◦ C is a sequential programming
language◦ What does a pointer or dynamic
memory allocation mean in hardware?
◦ Recursion◦ Floating-point arithmetic◦ How is I/O represented?◦ How are hardware design
parameters introduced? Solutions
◦ Support only a subset of C◦ User-specified parallelism◦ User-specified I/O◦ Extensive use of macros to guide
circuit generation
2 Primary Trends◦ Panic in the Microprocessors Industry
Next generation chips from Intel, AMD, Apple are all multi-core with integrated heterogeneous components
Hennessey/Patterson guideline not good enough anymore Renewed rigor into computer architecture research
Systems on a Chip are way too complicated to explore architecture options at RTL
◦ Emergence of FPGAs as a viable computing entity Industry accepted platforms for architecture prototyping and
research Extremely complicated to explore VLSI architecture options at
RTL
Restrict the ESL tool to a small set of algorithms that need acceleration beyond what microprocessors can provide
Take advantage of user expertise in describing a template for the architecture
Let the tool explore low level architecture optimization
Take advantage of gcc optimizations Ability to integrate 3rd party IP cores
Our Approach: It’s a Our Approach: It’s a workbenchworkbench
void anneal(int *current){ float temperature; int current_val, next_val; int next[MAX_EVENTS]; current_val = RAND_MAX; while (temperature > STOP_THRESHOLD) { copy(current, next); alter(next); next_val = evaluate(next); accept(¤t_val, next_val, current, next, temperature); temperature = adjustTemperature(); }}
;; Function anneal (anneal)anneal (current){ int next[10]; int next_val; int current_val; float temperature; double D.3292; float D.3291; int D.3290; # BLOCK 0 # PRED: ENTRY (fallthru) temperature = 1.0e+4; current_val = 2147483647; goto <bb 2> (<L1>); # SUCC: 2 (fallthru) # BLOCK 1 # PRED: 2 (true)<L0>:; copy (current, &next); alter (&next); D.3290 = evaluate (&next); next_val = D.3290; accept (¤t_val, next_val, current, &next, temperature); D.3291 = adjustTemperature (); temperature = D.3291; # SUCC: 2 (fallthru) # BLOCK 2 # PRED: 0 (fallthru) 1 (fallthru)<L1>:; D.3292 = (double) temperature; if (D.3292 > 1.00000000000000004792173602385929598312941379845e-4) goto <L0>; else goto <L2>; # SUCC: 1 (true) 3 (false) # BLOCK 3 # PRED: 2 (false)<L2>:; return; # SUCC: EXIT}
Problem: Given a circuit specification consisting of a set of components (adders / multipliers / etc.), estimate the FPGA resources (slices / BRAMs / DSP48s) used
Solution: Create a fifth-order equation for each (component, resource type) pair, representing usage as a function of data width
• Done using discrete values and Matlab curve-matching feature
• Fifth-order equation necessary for adequate estimationy = C5n5 + C4n4 + C3n3 + C2n2 + C1n + C0
List Scheduling Also known as “Critical Path Scheduling”
◦ Assign a static priority to each node in the graph◦ Schedule the nodes according to priority◦ Static priorities are assigned by measuring the
“distance” from the node in question and a sink node
Given a set of resources, determines time needed to complete a set of operations represented as a dependency graph
List Scheduling Example Schedule DFG on one
multiplier, one adder, and one divider
Multiplication and division take two cycles each (non-pipelined), addition takes one
TimeStep
+ * /
0 a
1 b
2 c e1
3 d e2
4 f1
5 f2
6 g1
7 g2
TimeStep
+ * /
0 a
1 c
2 b
3 d e1
4 e2
5 f1
6 f2
7 g1
8 g2
List Scheduling Heuristic method – does not guarantee an optimal schedule Computational complexity of only O(Tn), where T is the
number of time slots and n is the number of nodes to be scheduled
Improvement Methods◦ Modified Critical Path◦ Earliest Time First◦ Dynamic Critical Path◦ Critical Node Parent Trees◦ Cone-Based Clustering◦ Partial Critical Path scheduling
All O(n2) to O(n3) – too complicated for use inside of a simulated annealing loop
Solution – Ripple-List Schedulingassign static priority to each node in graphinitialize time to 0Loop while unscheduled nodes exist
Loop until no nodes can be scheduled on time stepupdate list of ready nodesschedule highest priority node possibleadjust priority of remaining nodes
EndLoopincrement time
EndLoop
Ripple Factor (Rf)
The degree of a vertex is the number of edges (both incoming and outgoing in the case of a directed graph) incident to it
DG = The largest vertex degree in the entire graph
d = distance between two vertices
dG
f DR
1
Ripple Factor DG = 3 The priorities of nodes that
are one step away get updated by a ripple factor of 1/31, those that are two steps away get updated by 1/32, etc.
Priorities are adjusted dynamically, but never jump to another priority band
Maximum ripple distance is applied to cut off updates and save computation (<<O(n2))
Balancing Latency across Pipeline Stages through ILP extraction Goal: Maximize pipelined architecture
performance within specified resource constraints◦ A pipeline can only run as fast as the latency of
the slowest stage◦ An efficient pipeline will balance the latency of
each stage as much as possible◦ Some stages can be redesigned to support
additional parallelism, others are fixed
Algorithm for Pipelined Processor DSE
1. Generate minimal set of ALUs needed for each stage in the pipeline
2. Compute latency of all stages (generate the architecture)3. Loop
1. Mark stage with “worst latency”2. Reduce the latency of this stage through exploitation of
parallelism until1. “Worst latency” can be passed to another stage2. If 1 is not possible, reduce latency as much as possibleIntertwined SA and RLS algorithms or Data-port width
extension where applicable4. End Loop when “worst latency” cannot be passed to
another stage
Example Generate minimal architecture for all stages
◦ Copy: 101 cycles 300 slices◦ Alter: 21 cycles 390 slices◦ Evaluate : 233 cycles 317 slices◦ Accept: 54 cycles 1408 slices
Mark stage with “worst latency”
Example Reduce the latency of this stage through
exploitation of parallelism until “worst latency” can be passed to another stage◦ Evaluate : 233 cycles 317 slices
◦ Evaluate: 95 cycles 777 slices
Allocation of additional resources
Example New numbers for all stages
◦ Copy: 101 cycles 300 slices◦ Alter: 21 cycles 390 slices◦ Evaluate : 95 cycles 777 slices◦ Accept: 54 cycles 1408 slices
Mark stage with “worst latency”
Example Reduce the latency of this stage through
exploitation of parallelism until “worst latency” can be passed to another stage◦ Copy: 101 cycles 300 slices
◦ Copy: 51 cycles 600 slices
Widening of memory ports toallow for 2-word transfers
Example New numbers for all stages
◦ Copy: 51 cycles 600 slices◦ Alter: 21 cycles 390 slices◦ Evaluate : 95 cycles 777 slices◦ Accept: 54 cycles 1408 slices
Repeat process until FPGA resources are exhausted or no more parallelism can be extracted from worst-performing stage
DSE Summary Stage performances can be improved
through◦ Allocating additional computational resources to a
stage such as adders, multipliers, etc.◦ Widening memory ports to accelerate block data
transfers Some stages cannot be improved
◦ If the task does not have any ILP
Performance Performance
Xilinx V4-SX35
The FLEX (flexible processor) can perform either DFG 1 or DFG 2 computations
Designed by taking the union of DFG 1 and DFG 2 data flow graphs
The FLEX processor can switch modes dynamically, depending on computational needs
Branch probabilities from gcov can guide the FLEX design – DFGs executed more frequently should be more optimized
Considerably superior to Partial Dynamic Reconfiguration using Xilinx EAPR 0.6 0.4
function main called 4 returned 100% blocks executed 100% -: 5:{ -: 5-block 0call 0 returned 100% -: 5-block 1branch 1 taken 86%(fallthrough)branch 2 taken 14% -: 5-block 2 -: 5-block 3 -: 5-block 4branch 3 taken 86%(fallthrough)branch 4 taken 14% -: 5-block 5 -: 5-block 6branch 5 taken 75%(fallthrough)branch 6 taken 25% -: 5-block 7 -: 5-block 8branch 7 taken 86%(fallthrough)branch 8 taken 14% -: 5-block 9 -: 5-block 10 -: 5-block 11…
On-going Technology Enhancement On-going Technology Enhancement (1):(1): FLEX Processor: FLEX Processor: Code Profiling using gcov
VHDL code can be compared with architecture description
Third-party Design Automation software used for synthesis, placement, debugging, verification, etc.◦ ChipScope Pro (Xilinx)
Timing closure◦ Improved metadata◦ Stringent constraint imposition on DSE