ESLT The next generation of Design Automation Tools

Goal of ESL tools History Motivation for ESLT in these times The USU-ESLT On-going research Conclusions

To automate the generation of SoC solutions from HLPL (such as C/C++/Java..)

To reduce design time of digital circuits from months to weeks◦ Initial VHDL generation should complete in

minutes◦ Functional verification/testing may take a few

weeks

Tools◦ Cones (1988)◦ HardwareC (Stanford)◦ Transmogrifier C◦ System C ◦ C2Verilog (1998)◦ Handel C◦ Bach C◦ SpecC◦ Trident C (LANL)◦ SPARK (UCI)◦ CASH (CMU)◦ Mitrion C◦ Impulse C (2004)◦ Catapult C (2006,

MG)

Challenges◦ C is a sequential programming

language◦ What does a pointer or dynamic

memory allocation mean in hardware?

◦ Recursion◦ Floating-point arithmetic◦ How is I/O represented?◦ How are hardware design

parameters introduced? Solutions

◦ Support only a subset of C◦ User-specified parallelism◦ User-specified I/O◦ Extensive use of macros to guide

circuit generation

2 Primary Trends◦ Panic in the Microprocessors Industry

Next generation chips from Intel, AMD, Apple are all multi-core with integrated heterogeneous components

Hennessey/Patterson guideline not good enough anymore Renewed rigor into computer architecture research

Systems on a Chip are way too complicated to explore architecture options at RTL

◦ Emergence of FPGAs as a viable computing entity Industry accepted platforms for architecture prototyping and

research Extremely complicated to explore VLSI architecture options at

RTL

Restrict the ESL tool to a small set of algorithms that need acceleration beyond what microprocessors can provide

Take advantage of user expertise in describing a template for the architecture

Let the tool explore low level architecture optimization

Take advantage of gcc optimizations Ability to integrate 3rd party IP cores

Our Approach: It’s a Our Approach: It’s a workbenchworkbench

void anneal(int *current){ float temperature; int current_val, next_val; int next[MAX_EVENTS]; current_val = RAND_MAX; while (temperature > STOP_THRESHOLD) { copy(current, next); alter(next); next_val = evaluate(next); accept(&current_val, next_val, current, next, temperature); temperature = adjustTemperature(); }}

;; Function anneal (anneal)anneal (current){ int next[10]; int next_val; int current_val; float temperature; double D.3292; float D.3291; int D.3290; # BLOCK 0 # PRED: ENTRY (fallthru) temperature = 1.0e+4; current_val = 2147483647; goto <bb 2> (<L1>); # SUCC: 2 (fallthru) # BLOCK 1 # PRED: 2 (true)<L0>:; copy (current, &next); alter (&next); D.3290 = evaluate (&next); next_val = D.3290; accept (&current_val, next_val, current, &next, temperature); D.3291 = adjustTemperature (); temperature = D.3291; # SUCC: 2 (fallthru) # BLOCK 2 # PRED: 0 (fallthru) 1 (fallthru)<L1>:; D.3292 = (double) temperature; if (D.3292 > 1.00000000000000004792173602385929598312941379845e-4) goto <L0>; else goto <L2>; # SUCC: 1 (true) 3 (false) # BLOCK 3 # PRED: 2 (false)<L2>:; return; # SUCC: EXIT}

Problem: Given a circuit specification consisting of a set of components (adders / multipliers / etc.), estimate the FPGA resources (slices / BRAMs / DSP48s) used

Solution: Create a fifth-order equation for each (component, resource type) pair, representing usage as a function of data width

• Done using discrete values and Matlab curve-matching feature

• Fifth-order equation necessary for adequate estimationy = C5n5 + C4n4 + C3n3 + C2n2 + C1n + C0

List Scheduling Also known as “Critical Path Scheduling”

◦ Assign a static priority to each node in the graph◦ Schedule the nodes according to priority◦ Static priorities are assigned by measuring the

“distance” from the node in question and a sink node

Given a set of resources, determines time needed to complete a set of operations represented as a dependency graph

List Scheduling Example Schedule DFG on one

multiplier, one adder, and one divider

Multiplication and division take two cycles each (non-pipelined), addition takes one

TimeStep

+ * /

0 a

1 b

2 c e1

3 d e2

4 f1

5 f2

6 g1

7 g2

TimeStep

+ * /

0 a

1 c

2 b

3 d e1

4 e2

5 f1

6 f2

7 g1

8 g2

List Scheduling Heuristic method – does not guarantee an optimal schedule Computational complexity of only O(Tn), where T is the

number of time slots and n is the number of nodes to be scheduled

Improvement Methods◦ Modified Critical Path◦ Earliest Time First◦ Dynamic Critical Path◦ Critical Node Parent Trees◦ Cone-Based Clustering◦ Partial Critical Path scheduling

All O(n2) to O(n3) – too complicated for use inside of a simulated annealing loop

Solution – Ripple-List Schedulingassign static priority to each node in graphinitialize time to 0Loop while unscheduled nodes exist

Loop until no nodes can be scheduled on time stepupdate list of ready nodesschedule highest priority node possibleadjust priority of remaining nodes

EndLoopincrement time

EndLoop

Ripple Factor (Rf)

The degree of a vertex is the number of edges (both incoming and outgoing in the case of a directed graph) incident to it

DG = The largest vertex degree in the entire graph

d = distance between two vertices

dG

f DR

1

Ripple Factor DG = 3 The priorities of nodes that

are one step away get updated by a ripple factor of 1/31, those that are two steps away get updated by 1/32, etc.

Priorities are adjusted dynamically, but never jump to another priority band

Maximum ripple distance is applied to cut off updates and save computation (<<O(n2))

Balancing Latency across Pipeline Stages through ILP extraction Goal: Maximize pipelined architecture

performance within specified resource constraints◦ A pipeline can only run as fast as the latency of

the slowest stage◦ An efficient pipeline will balance the latency of

each stage as much as possible◦ Some stages can be redesigned to support

additional parallelism, others are fixed

Algorithm for Pipelined Processor DSE

1. Generate minimal set of ALUs needed for each stage in the pipeline

2. Compute latency of all stages (generate the architecture)3. Loop

1. Mark stage with “worst latency”2. Reduce the latency of this stage through exploitation of

parallelism until1. “Worst latency” can be passed to another stage2. If 1 is not possible, reduce latency as much as possibleIntertwined SA and RLS algorithms or Data-port width

extension where applicable4. End Loop when “worst latency” cannot be passed to

another stage

Example Generate minimal architecture for all stages

◦ Copy: 101 cycles 300 slices◦ Alter: 21 cycles 390 slices◦ Evaluate : 233 cycles 317 slices◦ Accept: 54 cycles 1408 slices

Mark stage with “worst latency”

Example Reduce the latency of this stage through

exploitation of parallelism until “worst latency” can be passed to another stage◦ Evaluate : 233 cycles 317 slices

◦ Evaluate: 95 cycles 777 slices

Allocation of additional resources

Example New numbers for all stages


Mark stage with “worst latency”

Example Reduce the latency of this stage through

exploitation of parallelism until “worst latency” can be passed to another stage◦ Copy: 101 cycles 300 slices

◦ Copy: 51 cycles 600 slices

Widening of memory ports toallow for 2-word transfers

Example New numbers for all stages


Repeat process until FPGA resources are exhausted or no more parallelism can be extracted from worst-performing stage

DSE Summary Stage performances can be improved

through◦ Allocating additional computational resources to a

stage such as adders, multipliers, etc.◦ Widening memory ports to accelerate block data

transfers Some stages cannot be improved

◦ If the task does not have any ILP

Performance Performance

Xilinx V4-SX35

The FLEX (flexible processor) can perform either DFG 1 or DFG 2 computations

Designed by taking the union of DFG 1 and DFG 2 data flow graphs

The FLEX processor can switch modes dynamically, depending on computational needs

Branch probabilities from gcov can guide the FLEX design – DFGs executed more frequently should be more optimized

Considerably superior to Partial Dynamic Reconfiguration using Xilinx EAPR 0.6 0.4

function main called 4 returned 100% blocks executed 100% -: 5:{ -: 5-block 0call 0 returned 100% -: 5-block 1branch 1 taken 86%(fallthrough)branch 2 taken 14% -: 5-block 2 -: 5-block 3 -: 5-block 4branch 3 taken 86%(fallthrough)branch 4 taken 14% -: 5-block 5 -: 5-block 6branch 5 taken 75%(fallthrough)branch 6 taken 25% -: 5-block 7 -: 5-block 8branch 7 taken 86%(fallthrough)branch 8 taken 14% -: 5-block 9 -: 5-block 10 -: 5-block 11…

On-going Technology Enhancement On-going Technology Enhancement (1):(1): FLEX Processor: FLEX Processor: Code Profiling using gcov

VHDL code can be compared with architecture description

Third-party Design Automation software used for synthesis, placement, debugging, verification, etc.◦ ChipScope Pro (Xilinx)

Timing closure◦ Improved metadata◦ Stringent constraint imposition on DSE

Documents

ESLT The next generation of Design Automation Tools