Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong,...

Layout Driven Data Communication Optimization

for High Level Synthesis

Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer

Dept. of Electrical and Computer Engineering

University of California, Santa Barbara

Adam Kaplan, Philip Brisk and Majid Sarrafzadeh

Computer Science Department

University of California, Los Angeles

High Level SynthesisInput: Application description

written in *C (C, SystemC, HandelC, SpecC)

for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1;

x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos,

y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin;

x_filt<y_filt_lin; x_filt++,im_pos++)

sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; }

first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER);

Internal filter of an image convolver

SSA CDFG

Maximize Maximize “performance” (area, “performance” (area, latency, power, …) latency, power, …) subject to input subject to input constraintsconstraints

Output: “Hardware” (RTL Specification)

Target Architectures “Spatial” architectures

Local control between data path, global data flow between control nodes Lots of distributed computational units, memory Coarse/fine grained reconfigurable architectures

Techniques could be used for other architectures May not make sense Our design flow has little resource sharing

Fine grainconfigurableplatform

Coarse grainprogrammableplatform

Obligatory Design Flow SlideSUIF:

Syntactic &SemanticAnalysis

ApplicationSpecification

ASTMachine

SUIF:CompilerBackend

SSACDFG

4. Synthesize behavioral HDL code to RTL code

Behavioral Synthesis

Logical & Physical Synthesis

8. Synthesize RTL code

Entity 1

Entity 3Entity 2

Entity 4

6. Determine structural controland data communicationbetween basic block entities

7. Generate synthesizable RTL code

CFG Entity5. Create CFG interface

entity cfg is…

architecture behavioral of cfg…

2. Transform instruction list to dataflow graph

1. Create interface

3. Transform dataflow graph to behavioral HDL code

Basic Block Entity

entity basic_block is…

architecture behavioral of basic_block…

entity basic_block is

Design Example

/* perform radix 4 iterations */ for(i = 1;

i <= n4pow; i++) { nn *= 4; in = n / nn; FR4TR(in, nn, b, b + in, b + 2 * in, b + 3 * in); }

/* perform inplace reordering */ FORD1(n2pow, b); FORD2(n2pow, b);

/* take conjugates */ for(i = 3;

i < n; i += 2) b[i] = -b[i];

return 1;}

int FAST(real *b, int n) { real fn; int i, in, nn, n2pow, n4pow, nthpo; n2pow = fastlog2(n); if(n2pow <= 0)

return 0; nthpo = n; fn = nthpo; n4pow = n2pow / 2;

/* radix 2 iteration required; do it now */ if(n2pow % 2) { nn = 2; in = n / nn; FR2TR(in, b, b + in); } else nn = 1;

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

Node 8

Node 9

Node 10

“FAST” function from MediaBench

Some nodes missing - simple computation, merged into others

Lines below show data communication

Characterizing Data Communication

Examples of data communication schemes

Control Node 3

Control Node 2

Control Node 4

Memory(Register

Bank,RAM)

Control Node 4

Control Node 2

Control Node 3

Distributed Distributed Centralized Centralized Data communication = wireData communication = memory access

Identifying Data Communication Determine relationship between place(s) where data is

defined and where data is used

c …b …

Naïve method: all use-points of a variable depend on all definitions of that variable

Not all use points “use” a variable

Need analysis to minimize Need analysis to minimize the amount of data the amount of data communicationcommunication

Global Data Communication = 5 variables

Must determine relationship between where data is generated and where data is used

Problem formulations [DAC03]: Minimize the total number of

bits communicated between all pairs of control nodes

Today: Minimize overall wirelength SSA (Static Single Assignment)

Changes each variable to have a unique definition point

Must add -nodes to merge definitions

Use of SSA in Compilation

c …b …

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)

SSA algorithms Find location of -nodes Rename variables

Three main SSA algorithms Minimal, Pruned – Cytron et al. Semi-pruned – Briggs et al.

Differ in number and location of -nodes Minimal – insert -nodes at

iterated dominance frontier (IDF) Semi-pruned – insert -node at

IDF if variable live outside some basic block Pruned – insert -node at

IDF if variable live at that time

SSA Fundamentals

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)

c2 (c1)b3 (b1,b2)

MinimalMinimal

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)b3 (b1,b2)

Semi-PrunedSemi-Pruned

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)

PrunedPruned

jiwTEW ),(

TEW Ratio

benchmark

) Minimal

Semi-pruned

Results: SSA for Data Comm. Minimization

Edge Weight w(i,j)– number of bits communicated from node i to j

Total Edge Weight (TEW) - corresponds to amount of data communication

“MediaBench”marks

Further Minimizing Data Communication

Current SSA algorithms place -nodes temporally In software compilation, live ranges should be short Appropriate in hardware?

Spatial -node distribution

Temporal -node distribution

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)

TEW = 4

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)TEW = 3

Spatial -nodes Distribution Algorithm

d – number of uses of -node destination s – number of -node source values Number of temporal links Number of spatial links dsCS

a3(a0,a1,a2)

s = 3s = 3

d = 2d = 2

Optimal assuming “ideal” n-dimensional floorplan

1. Given a CDFG G(N cfg , E cfg) 2. perform_SSA( G) 3. calculate_def_use _chains( G) 4. remove_back_edges( G) 5. topological_sort( G) 6. for each node n Ncfg 7. for each -node n 8. s |.sources | 9. d |def_use_chain( .dest) | 10. if s d < s + d 11. move_to_spatial_locations( ) 12. restore _back_edges( G)

Physically Aware Compiler Transforms

Consider layout information during compilation Modify transforms to consider physical info Ideal: full physical synthesis – extremely

accurate, but way too time consuming

PhysicalSynthesis

HardwareCompilation

application

Floor-planner

Approximate using floorplanningMuch fasterGives “good enough” high level

physical picture Our previous data comm. work

No physical informationCan lead to negative results

Let’s Get Physical!

Physically Aware Data Communication

Modify placement of Φ-functions to consider wirelength

1. Given a CFG Gcfg(Vcfg, Ecfg)

2. perform_ssa(Gcfg)

3. calculate_def_use_chains(Gcfg)

4. remove_back_edges(Gcfg)

5. topological_sort(Gcfg)

6. foreach vertex v Vcfg

7. foreach -node v

8. s .sources

9. d |def_use_chain(.dest)|

10. IDF iterated_dominance_fronter(s)

11. PossiblePlacements findPlacementOptions(IDF)

12. place()

selectBest(PossiblePlacements)

13. distribute/duplicate to place()

-Placement Algorithm

1. Given a set of CFG Nodes R

2. -options

3. insert(R) into-options

4. foreach instruction i R

5. if( i is a destination of -function f )

6. return -options

7. temp_-options

8. foreach non-dominated child c of R

9. temp_-options crossProductJoin(temp__options, findPlacementOptions(c))

10. return-options temp_-options

FindPlacementOptions Algorithm

Algorithm in Action

b1 …

a2 …

a3 …

a1 …

c1 …b2 …

a4 (a2,a3)

a4 (a2,a3) a4 (a2,a3) a4 (a2,a3)

a4 (a2,a3)

Evaluate all options for -nodes Replicate when necessary Limit amount of replication - most

often leads to more wirelength Can play tricks to limit redundant

placementsTraditional (temporal)

Spatial [DAC03]Spatial [DAC03]

Traditional (temporal)

Any of these options could yield the best wirelength

Highly dependent on the floorplan

Algorithm in Action

FAST function from MediaBench testsuite

nn_4, i_2 nn_5, i_3

Algorithm in Action

T Fnn_4, i_2 nn_5, i_3

nn_4, i_2 nn_5, i_3

PhysicalSynthesis

HardwareCompilation

FullFloor-

planner

1. Initial optimization minimizes data communication

2. Full SA based floorplanning3. Reoptimization based to minimize

floorplanning4. Full SA based floorplanning

Floorplan Wirelength

100000

1000000

10000000

benchmark

WL (first)

WL (second)

Spectacularly negative results

Full Floorplanning Results

Simple iterative approach

Incremental Floorplanning

Incremental Placement [Coudert et al]: Given an optimized placement and a set of changes to

the netlist (e.g., due to technology remapping) modify the placement to improve it.

Equally applicable to floorplanning

Initial Floorplan Modified Floorplan

Perturbations 1

floorplanmodules (e.g. due to -function movement) floorplan

2/2.3 - 9/10.1 -

11/12.4 - 16/18 -

5/5.6 - 27/30.4 -

32/36 -

Incremental Floorplan

Our Incremental Floorplanner

IncrementalFloorplanner

Initial Floorplan Modified Floorplan

Perturbations1

Our Incremental Floorplanner

1. Calculate area & room of each node: bottom up slicing tree traversal

2. Area redistribution Top down traversal Increase area if necessary

Not enough space at root Aspect ratios become too distorted

2/2.3 - 9/10.1 -

11/12.4 - 16/18 -

5/5.6 - 27/30.4 -

32/36 -

Incremental FloorplanModified Floorplan

Simple, yet effective

Other more complicated algorithms might work better

MediaBench Functions

Benchmark Blocks Links Weight Initial WL

1adpcmcoder

33 31 54 2688 35568

2adpcm

decoder26 23 44 1952 21588

3internal

filter10 143 60 17088 411637

4Internalexpand

101 94 257 14336 317031

5compress

output34 17 60 2368 29114

6mpeg2dec

block62 13 66 2272 34510

7mpeg2dec

vector16 4 26 1024 4366

8 FAST 14 4 15 704 3714

9 FR4TR 77 87 155 704 340697

10 det 12 5 13 7936 3772

1 2 3 4 5 6 7 8 9 10 avrg

Initial Overall Optimal Overall Incremental Phi Optimal Phi Incremental

Incremental Floorplanning Results

alized

Benchmarks

“Optimal” Approach:12% Overall Wirelength Reduction

25% Phi-node Wirelength Reduction

Our Approach:6% Overall Wirelength

Reduction 8% Phi-node Wirelength

Reduction

Related Work

Hardware compilation projects using SSA PDG+SSA form [UCSB] CASH [CMU] SA-C [UCR] Sea Cucumber [BYU]

Physically aware behavioral synthesis techniques SA for scheduling, binding and floorplanning [Prabhakaran97] SA for binding and floorplanning [Yung-Ming94] Scheduling, allocation and binding [Dougherty00] Fasolt: bus topology [Knapp92] High level synthesis [Tarafdar00]

Incremental CAD Problem overview/challenges [Coudert00] Floorplanning [Crenshaw99]

Conclusions

It’s been a long strange trip…

SSA a nice IR for hardware compilationExplicitly shows data flowUseful for exploiting parallelism

Compiler techniques applied to hardware design can reduce wirelengthThey must be aware of physical informationThey must use an incremental floorplanning

Questions?

(and cue for applause)

Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong,...

Documents

Deconstruction of Forrest Forrest Gu-UH-mp.pptx

Simultaneous Information Flow Security and Circuit Redundancy in Boolean Gates Ryan Kastner (kastner@ucsd.edu) Department of Computer Science & Engineering

Amicus Brief Kastner Banchero Supporting Edwards

Gregg kastner presentation

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

Kastner Erich-A repülő osztály.pdf

Lecture 6 Program Flow Analysis Forrest Brewer Ryan Kastner Jose Amaral

Kathy kastner patient_centricity

Colonialism, Nationalism, Neocolonialism Sarah Bishop Cecily David Kay Kastner Faridah Nassali Sarah Bishop Cecily David Kay Kastner Faridah Nassali

Kastner Sonic Branding

Kastner & Öhler - Beauty Magazin Herbst 2010

A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,

WELCOME TO KASTNER INTERMEDIATE SCHOOL THUNDERBIRDS

Georges Kastner-Sextuor

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

Laura Kastner, Ph.D. - MVLA Speaker Series

Chapter 13 Operation Scheduling: Algorithms and …cseweb.ucsd.edu/~kastner/papers/hls_book_chapter-aco.pdfChapter 13 Operation Scheduling: Algorithms and Applications Gang Wang, Wenrui

Kastner & Öhler Magazin

Kastner, Pierre Gc5 2010 Kastner Poster

FORREST PRIMARY SCHOOL Forrest Early Learning Centre ......Emergency Management Plan Forrest Learning Centre- Forrest Primary School (February 2014) 11 7. Risk Assessment- Learning