View
229
Download
0
Category
Tags:
Preview:
Citation preview
Layout Driven Data Communication Optimization
for High Level Synthesis
Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer
Dept. of Electrical and Computer Engineering
University of California, Santa Barbara
Adam Kaplan, Philip Brisk and Majid Sarrafzadeh
Computer Science Department
University of California, Los Angeles
High Level SynthesisInput: Application description
written in *C (C, SystemC, HandelC, SpecC)
for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1;
x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos,
y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin;
x_filt<y_filt_lin; x_filt++,im_pos++)
sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; }
first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER);
Internal filter of an image convolver
SSA CDFG
Maximize Maximize “performance” (area, “performance” (area, latency, power, …) latency, power, …) subject to input subject to input constraintsconstraints
Output: “Hardware” (RTL Specification)
Target Architectures “Spatial” architectures
Local control between data path, global data flow between control nodes Lots of distributed computational units, memory Coarse/fine grained reconfigurable architectures
Techniques could be used for other architectures May not make sense Our design flow has little resource sharing
Fine grainconfigurableplatform
Coarse grainprogrammableplatform
Obligatory Design Flow SlideSUIF:
Syntactic &SemanticAnalysis
ApplicationSpecification
ASTMachine
SUIF:CompilerBackend
SSACDFG
4. Synthesize behavioral HDL code to RTL code
Behavioral Synthesis
Logical & Physical Synthesis
8. Synthesize RTL code
Entity 1
Entity 3Entity 2
Entity 4
6. Determine structural controland data communicationbetween basic block entities
7. Generate synthesizable RTL code
CFG Entity5. Create CFG interface
entity cfg is…
architecture behavioral of cfg…
2. Transform instruction list to dataflow graph
1. Create interface
++
+ *
*
3. Transform dataflow graph to behavioral HDL code
Basic Block Entity
entity basic_block is…
architecture behavioral of basic_block…
entity basic_block is
Design Example
/* perform radix 4 iterations */ for(i = 1;
i <= n4pow; i++) { nn *= 4; in = n / nn; FR4TR(in, nn, b, b + in, b + 2 * in, b + 3 * in); }
/* perform inplace reordering */ FORD1(n2pow, b); FORD2(n2pow, b);
/* take conjugates */ for(i = 3;
i < n; i += 2) b[i] = -b[i];
return 1;}
int FAST(real *b, int n) { real fn; int i, in, nn, n2pow, n4pow, nthpo; n2pow = fastlog2(n); if(n2pow <= 0)
return 0; nthpo = n; fn = nthpo; n4pow = n2pow / 2;
/* radix 2 iteration required; do it now */ if(n2pow % 2) { nn = 2; in = n / nn; FR2TR(in, b, b + in); } else nn = 1;
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Node 8
Node 9
Node 10
“FAST” function from MediaBench
Some nodes missing - simple computation, merged into others
Lines below show data communication
Characterizing Data Communication
Examples of data communication schemes
Control Node 3
Control Node 2
Control Node 4
Memory(Register
Bank,RAM)
Control Node 4
Control Node 2
Control Node 3
Bus
Distributed Distributed Centralized Centralized Data communication = wireData communication = memory access
Identifying Data Communication Determine relationship between place(s) where data is
defined and where data is used
b …
a …
a
a …
a …
c …b …
b c
Naïve method: all use-points of a variable depend on all definitions of that variable
Not all use points “use” a variable
Need analysis to minimize Need analysis to minimize the amount of data the amount of data communicationcommunication
Global Data Communication = 5 variables
Must determine relationship between where data is generated and where data is used
Problem formulations [DAC03]: Minimize the total number of
bits communicated between all pairs of control nodes
Today: Minimize overall wirelength SSA (Static Single Assignment)
Changes each variable to have a unique definition point
Must add -nodes to merge definitions
Use of SSA in Compilation
b …
a …
a
a …
a …
c …b …
b c
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)
SSA algorithms Find location of -nodes Rename variables
Three main SSA algorithms Minimal, Pruned – Cytron et al. Semi-pruned – Briggs et al.
Differ in number and location of -nodes Minimal – insert -nodes at
iterated dominance frontier (IDF) Semi-pruned – insert -node at
IDF if variable live outside some basic block Pruned – insert -node at
IDF if variable live at that time
SSA Fundamentals
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)
c2 (c1)b3 (b1,b2)
MinimalMinimal
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)b3 (b1,b2)
Semi-PrunedSemi-Pruned
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)
PrunedPruned
i j
jiwTEW ),(
TEW Ratio
0.1
1
10
100
benchmark
TE
W r
ati
o (
vs
. p
run
ed
) Minimal
Semi-pruned
Results: SSA for Data Comm. Minimization
Edge Weight w(i,j)– number of bits communicated from node i to j
Total Edge Weight (TEW) - corresponds to amount of data communication
“MediaBench”marks
Further Minimizing Data Communication
Current SSA algorithms place -nodes temporally In software compilation, live ranges should be short Appropriate in hardware?
Spatial -node distribution
Temporal -node distribution
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)
TEW = 4
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)TEW = 3
Spatial -nodes Distribution Algorithm
d – number of uses of -node destination s – number of -node source values Number of temporal links Number of spatial links dsCS
dsCT
a3(a0,a1,a2)
a3 a3
s = 3s = 3
d = 2d = 2
Optimal assuming “ideal” n-dimensional floorplan
1. Given a CDFG G(N cfg , E cfg) 2. perform_SSA( G) 3. calculate_def_use _chains( G) 4. remove_back_edges( G) 5. topological_sort( G) 6. for each node n Ncfg 7. for each -node n 8. s |.sources | 9. d |def_use_chain( .dest) | 10. if s d < s + d 11. move_to_spatial_locations( ) 12. restore _back_edges( G)
Physically Aware Compiler Transforms
Consider layout information during compilation Modify transforms to consider physical info Ideal: full physical synthesis – extremely
accurate, but way too time consuming
PhysicalSynthesis
HardwareCompilation
application
Floor-planner
Approximate using floorplanningMuch fasterGives “good enough” high level
physical picture Our previous data comm. work
No physical informationCan lead to negative results
Let’s Get Physical!
Physically Aware Data Communication
Modify placement of Φ-functions to consider wirelength
1. Given a CFG Gcfg(Vcfg, Ecfg)
2. perform_ssa(Gcfg)
3. calculate_def_use_chains(Gcfg)
4. remove_back_edges(Gcfg)
5. topological_sort(Gcfg)
6. foreach vertex v Vcfg
7. foreach -node v
8. s .sources
9. d |def_use_chain(.dest)|
10. IDF iterated_dominance_fronter(s)
11. PossiblePlacements findPlacementOptions(IDF)
12. place()
selectBest(PossiblePlacements)
13. distribute/duplicate to place()
-Placement Algorithm
1. Given a set of CFG Nodes R
2. -options
3. insert(R) into-options
4. foreach instruction i R
5. if( i is a destination of -function f )
6. return -options
7. temp_-options
8. foreach non-dominated child c of R
9. temp_-options crossProductJoin(temp__options, findPlacementOptions(c))
10. return-options temp_-options
FindPlacementOptions Algorithm
Algorithm in Action
b1 …
a2 …
a4
a3 …
a1 …
c1 …b2 …
b1 c1
a4 (a2,a3)
a4 (a2,a3) a4 (a2,a3) a4 (a2,a3)
a4 (a2,a3)
a4 (a2,a3)
Evaluate all options for -nodes Replicate when necessary Limit amount of replication - most
often leads to more wirelength Can play tricks to limit redundant
placementsTraditional (temporal)
Spatial [DAC03]Spatial [DAC03]
Traditional (temporal)
Any of these options could yield the best wirelength
Highly dependent on the floorplan
Algorithm in Action
FAST function from MediaBench testsuite
F T
T F
N3
nn_4, i_2 nn_5, i_3
N9
Algorithm in Action
F T
T Fnn_4, i_2 nn_5, i_3
N3
N9
F T
T F
N3
nn_4, i_2 nn_5, i_3
N9
PhysicalSynthesis
HardwareCompilation
FullFloor-
planner
1. Initial optimization minimizes data communication
2. Full SA based floorplanning3. Reoptimization based to minimize
floorplanning4. Full SA based floorplanning
Floorplan Wirelength
1
10
100
1000
10000
100000
1000000
10000000
benchmark
wire
leng
th (l
ogar
ithm
ic)
WL (first)
WL (second)
Spectacularly negative results
Full Floorplanning Results
Simple iterative approach
Incremental Floorplanning
Incremental Placement [Coudert et al]: Given an optimized placement and a set of changes to
the netlist (e.g., due to technology remapping) modify the placement to improve it.
Equally applicable to floorplanning
6
1
2
3
4
6
Initial Floorplan Modified Floorplan
Perturbations 1
2
3
4
6
6
1
floorplanmodules (e.g. due to -function movement) floorplan
1
2
3
4
6
6
|
2/2.3 - 9/10.1 -
11/12.4 - 16/18 -
5/5.6 - 27/30.4 -
32/36 -
-
3
-
2
1
4
Incremental Floorplan
Our Incremental Floorplanner
IncrementalFloorplanner
6
1
2
3
4
6
Initial Floorplan Modified Floorplan
Perturbations1
2
3
4
6
Our Incremental Floorplanner
1. Calculate area & room of each node: bottom up slicing tree traversal
2. Area redistribution Top down traversal Increase area if necessary
Not enough space at root Aspect ratios become too distorted
1
2
3
4
6
6
|
2/2.3 - 9/10.1 -
11/12.4 - 16/18 -
5/5.6 - 27/30.4 -
32/36 -
-
3
-
2
1
4
Incremental FloorplanModified Floorplan
1
2
3
4
Simple, yet effective
Other more complicated algorithms might work better
MediaBench Functions
Benchmark Blocks Links Weight Initial WL
1adpcmcoder
33 31 54 2688 35568
2adpcm
decoder26 23 44 1952 21588
3internal
filter10 143 60 17088 411637
4Internalexpand
101 94 257 14336 317031
5compress
output34 17 60 2368 29114
6mpeg2dec
block62 13 66 2272 34510
7mpeg2dec
vector16 4 26 1024 4366
8 FAST 14 4 15 704 3714
9 FR4TR 77 87 155 704 340697
10 det 12 5 13 7936 3772
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 avrg
Initial Overall Optimal Overall Incremental Phi Optimal Phi Incremental
Incremental Floorplanning Results
Norm
alized
Wir
ele
ng
th
Benchmarks
“Optimal” Approach:12% Overall Wirelength Reduction
25% Phi-node Wirelength Reduction
Our Approach:6% Overall Wirelength
Reduction 8% Phi-node Wirelength
Reduction
avg
Related Work
Hardware compilation projects using SSA PDG+SSA form [UCSB] CASH [CMU] SA-C [UCR] Sea Cucumber [BYU]
Physically aware behavioral synthesis techniques SA for scheduling, binding and floorplanning [Prabhakaran97] SA for binding and floorplanning [Yung-Ming94] Scheduling, allocation and binding [Dougherty00] Fasolt: bus topology [Knapp92] High level synthesis [Tarafdar00]
Incremental CAD Problem overview/challenges [Coudert00] Floorplanning [Crenshaw99]
Conclusions
It’s been a long strange trip…
SSA a nice IR for hardware compilationExplicitly shows data flowUseful for exploiting parallelism
Compiler techniques applied to hardware design can reduce wirelengthThey must be aware of physical informationThey must use an incremental floorplanning
Questions?
(and cue for applause)
Recommended