High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis with LegUpA Crash Course for Users and Researchers

Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi

11 February 2013ACM FPGA Symposium

Monterey, CADept. of Electrical and Computer EngineeringUniversity of Toronto

LegUpLegUp

LegUpLegUp

LegUp

LegUp

LegUp

LegUp

LegUp

Hong Kong Berlin

Tokyo New York City

Tutorial Outline

• Overview of LegUp and its algorithms (60 min)• Labs (“hands on” via VirtualBox)

– Lab 1: Using the LegUp Framework (30 min)– Break– Lab 2: Adding resource constraints (30 min)– Lab 3: Changing How LegUp implements

hardware (30 min)

Project Motivation

• Hardware design has advantages over software:– Speed– Energy-efficiency

• Hardware design is difficult and skills are rare:– 10 software engineers for every hardware engineer*

• We need a CAD flow that simplifies hardware design for software engineers

*US Bureau of Labour Statistics ‘08

Top-Level Vision

Program code

C CompilerProcessor

(MIPS)

Self-ProfilingProcessor

Profiling Data:

Execution CyclesPower

Cache Misses

High-levelsynthesis Suggested

programsegments to

target to HWFPGA fabric

P Hardenedprogramsegments

Altered SW binary (calls HW accelerators)

int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum);}....

LegUp: Key Features• C to Verilog high-level synthesis• Many benchmarks (incl. 12 CHStone)• MIPS processor (Tiger)• Hardware profiler• Automated verification tests• Open source, freely downloadable

– Like ABC (Synthesis) or VPR (Place & Route)– 600+ downloads since March 2011– http://legup.eecg.utoronto.ca

FPGA

System Architecture

MIPS ProcessorHardware

Accelerator

AVALON INTERFACE

Hardware Accelerator

Memory ControllerOn-Chip Cache

Memory

Off-Chip MemoryALTERA DE2 or DE4 Board

Cyclone II or Stratix IV

Memory Memory

High-Level Synthesis Framework• Leverage LLVM compiler infrastructure:

– Language support: C/C++– Standard compiler optimizations– More on this shortly

• We support a large subset of ANSI C: Supported UnsupportedFunctions Dynamic MemoryArrays, Structs RecursionGlobal VariablesPointer ArithmeticFloating Point

tAddr+= V1tAddr += (tAddr << 8)tAddr ^= (tAddr >> 4)b = (tAddr >> B1) & B2a = (tAddr + (tAddr << A1)) >> A2fNum = (a ^ tab[b])

Address Hash(in hardware)

Hardware Profiler Architecture

MIPS P Instr. $

Op Decoderret call

instr

0 1

PC

function #

targetaddress

F#

count

Popped F#(ret | call)

PC

counter+ 0

1

reset

0

Incr. when PC changes

Counter StorageMemory

(for all functions)

Call Stack

count

Data Counter(for current function)

See paper IEEE ASAP’11

• Monitor instr. bus to detect function call/ret.

• Call: Hash (in HW) from function address to index; push to stack.

• Ret: pop function index from stack.

• Use function indexes to associate profiling data (e.g. cycles, power) with counters.

Processor/Accelerator Hybrid Flow

int main () {…sum = dotproduct(N);...

}

int dotproduct(int N) {…for (i=0; i<N; i++) {

sum += A[i] * B[i];}return sum;

}



}

int dotproduct(int N) {…for (i=0; i<N; i++) {

sum += A[i] * B[i];}return sum;

}

#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C

int legup_dotproduct(int N) {*dotproduct_ARG1 = (volatile int) N;*dotproduct_STATUS = 1;return *dotproduct_DATA;

}



}

set_accelerator_function “dotproduct”

HW Accelerator

HLS


}

Processor/Accelerator Hybrid Flow#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C


}

sum = legup_dotproduct(N);

int main () {…

...}

Processor/Accelerator Hybrid Flow#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C


}

MIPS Processor

SW

sum = legup_dotproduct(N);

How Does LegUp Handle Memory and Pointers?

• LegUp stores each array in a separate FPGA BRAM• BRAM data width matches the data in the array• Each BRAM is identified by a 9-bit tag• Addresses consist of the RAM tag and array index:

• A shared memory controller uses the tag bit to determine which BRAM to read or write from

• The array index is the address passed to the BRAM

9-bit Tag 23-bit Index31 22 023

Pointer Example

• We have two arrays in the C function:– int A[100], B[100]

• Tag 0 is reserved for NULL pointers• Tag 1 is reserved for off-chip memory• Assign tag 2 to array A and tag 3 to array B• Address of A[3]: Address of B[7]:

Tag=2 Index=331 02223

Tag=3 Index=731 02223

FF FF

Shared Memory Controller

• Both arrays A and B have 100 element BRAMs• Load from pointer D:

Tag=2 Index=1331 02223

A[0]0

...

A[13]

….

13

BRAM Tag=2A[99]99

B[0]0

...

B[13]

….

13

BRAM Tag=3B[99]99

3

2A[13]

32

3232

Core Benchmarks (+Many More)• 12 CHStone Benchmarks (JIP’09) and Dhrystone

– Too large/complex for academic HLS tools• Include golden input/output test vectors

• Not supported by academic toolsCategory Benchmarks Lines of C code

Arithmetic 64-bit double precision: add, mult, div, sin

376 – 755

Encryption AES, Blowfish, SHA 716 – 1,406

Processor MIPS processor 232

Media JPEG decoder, Motion, GSM, ADPCM 393 – 1,692

General Dhrystone 491

Experimental ResultsLegUp 1.0 (2011) for Cyclone II

1. Pure software on MIPS

Hybrid (software/hardware):2. Second most compute-intensive function

(and descendants) in H/W3. Same as 2 but with most compute-intensive

4. Pure hardware using LegUp5. Pure hardware using eXCite (commercial tool)

Experimental Results

MIPS-S

W

LegU

p-Hyb

rid2

LegU

p-Hyb

rid1

LegU

p-HW

eXCite-H

W0

500

1000

1500

2000

2500

0

5000

10000

15000

20000

25000

30000

35000

40000

# of LEsExec. time

Exec

ution

tim

e (g

eom

etric

mea

n)

# of

LEs

(geo

met

ric m

ean)

Comparison: LegUp vs eXCite• Benchmarks compiled to hardware• eXCite: Commercial high-level synthesis tool

• Couldn’t compile Dhrystone

Geomean LegUp eXcite LegUp/eXciteCircuit Runtime (μs) 292 357 0.82 (1.22x)Logic Elements 15,646 13,101 1.19Area-Delay Product 4.57M 4.68M 0.98

Energy Consumption

MIPS-S

W

LegU

p-Hyb

rid2

LegU

p-Hyb

rid1

LegU

p-HW

eXCite-H

W -

100,000

200,000

300,000

400,000

500,000

600,000

Ener

gy (μ

J) (g

eom

etric

mea

n)

18x less energy than software

Current Release: LegUp 3.0

• Loop pipelining• Dual and multi-ported memory support• Bitwidth minimization• Multi-pumping DSP units for area reduction• Alias analysis for dependency checks• Parallel accelerators via Pthreads & OpenMP

Results now considerably better than LegUp 1.0 release

LegUp 3.0 vs. LegUp 1.0

adpcm ae

s

blowfishdfad

ddfdiv

dfmul

dfsin

dhrystone

gsm jpegmips

motion sha

geomea

n0.4

0.6

0.8

1

1.2

1.4

1.6

Wall-Clock TimeCyclesFmaxLEs

CHStone Benchmark Circuit

LegU

p 3.

0/Le

gUp

1.0

Ratio

Wall-clock time: 16% betterCycle latency: 31% better

FMax: 18% worseLEs (area): 28% better

LLVM Compiler and HLS Algorithms

LLVM Compiler

• Open-source compiler framework.– http://llvm.org

• Used by Apple, NVIDIA, AMD, others.• Competitive quality with gcc.• LegUp HLS is a “back-end” of LLVM.

• LLVM: low-level virtual machine.

LLVM Compiler

• LLVM will compile C code into a control flow graph (CFG)

• LLVM will perform standard optimizations– 50+ different optimizations in LLVM

C Programint FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return sum;}....

LLVM

Compiler

CFG

BB0

BB1

BB2

Control Flow Graph

• Control flow graph is composed of basic blocks• basic block: is a sequence of instructions

terminated with exactly one branch– Can be represented by an acyclic data flow graph:

CFG

BB0

BB1

BB2

load load

+

load

+

store

LLVM Details

• Instructions in basic blocks are primitive computational operations:– shift, add, divide, xor, and, etc.

• Or are control-flow operations:– branch, call, etc.

• The CDFG is represented in LLVM’s intermediate representation (IR)– IR is machine-independent assembly code.

High-Level Synthesis FlowC Compiler

(LLVM)C Program

Allocation

Scheduling

Binding

Target H/W Characterization

RTL Generation

User Constraints• Timing• Resource

Synthesizable Verilog

Optimized LLVM IR

Scheduling

• Scheduling: is the task of scheduling operations into clock cycles using a finite state machine

load load

+ load

+

store

State 1

State 0

State 2

State 3

FSM Schedule

Binding

• Binding: is the task of assigning scheduled operations to functional units in the datapath

load load

+ load

+

store

Schedule Datapath

2-port RAM +

FF

High-Level Synthesis: Scheduling

SDC Scheduling

• SDC System of Difference Constraints– Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC

formulation”. DAC 2006: 433-438.

• Basic idea: formulate scheduling as a mathematical optimization problem– Linear objective function + linear constraints

(==, <=, >=).• The problem is a linear program (LP)

– Solvable in polynomial time with standard solvers

Define Variables• For each operation i to

schedule, create a variable ti.

• The ti’s will hold the cycle # in which each op is scheduled.

• Here we have:– tadd, tshift, tsub

+ <<

-

Data flow graph (DFG): already accessible in LLVM.

Dependency Constraints

• In this example, the subtract can only happen after the add and shift.

• tsub – tadd >= 0

• tsub – tshift >= 0

• Hence the name difference constraints.

add shift

sub

Handling Clock Period Constraints

• Target period: P (e.g., 10 ns)• For each chain of dependant

operations in DFG, estimate the path delay D (LegUp’s models)– E.g.: D from mod -> or = 23 ns.

• Compute: R = ceiling(D/P) - 1– E.g.: R = 2

• Add the difference constraint:– tor - tmod >= 2

mod

xor

shr

or

Resource Constraints

• Restriction on # of operations of a given type that can execute in a cycle

• Why we need it?– Want to use dual-port RAMs in FPGA

• Allow up to 2 load/store operations in a cycle

– Floating point• Do not want to instantiate many FP cores of a given

type, probably just one• Scheduling must honour # of FP cores available

Resource Constraints in SDC

• Res-constrained scheduling is NP-hard.• Implemented approach in [Cong & Zhang DAC2006]

+ +

+

+

+ +

+

+A B

C

D

E F

G

H

Say want to schedule with only have 2 addersin the HW (lab #2)

Add SDC Constraints

• Generate a topological ordering of the resource-constrained operations.

• Say constrained to 2 adders in HW.• Starting at C in the ordering, create a

constraint: tC – tA > 0

• Next consider, E, add constraint: tE - tB > 0• Continue to the end• Resulting schedule will have <= 2 adds / cycle

A B C E F D G H

ASAP Objective Function

• Minimize the sum of the variables:

• Operations will be scheduled as early as possible, subject to the constraints

• LP program solvable in polynomial time

High-Level Synthesis: Binding

High-Level Synthesis: Binding

• Weighted bipartite matching-based binding– Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted

matching”. DAC 1990: 499-504.

• Finds the minimum weighted matching of a bipartite graph at each step – Solve using the Hungarian Method (polynomial)

operations

hardware functional units

edge costs

Binding

• Bind the following scheduled program

State 0

State 1

State 2

State 3

Binding

• Resource Sharing: requires 3 multipliers

State 0

State 1

State 2

State 3

State 0

State 1

State 2

State 3

Binding

• Bind the first cycle Functional Units

1

1

1

State 0

State 1

State 2

State 3

Binding

• Bind the second cycle Functional Units

2

2

1

State 0

State 1

State 2

State 3

Binding

• Bind the third cycle Functional Units

2

2

2

State 0

State 1

State 2

State 3

Binding

• Bind the fourth cycle Functional Units

3

2

2

Binding

• Required Multiplexing: Functional Units

3

2

2

High-Level Synthesis: Challenges

• Easy to extract instruction level parallelism using dependencies within a basic block

• But C code is inherently sequential and it is difficult to extract higher level parallelism

• Coarse-grained parallelism: – function pipelining

• Fine-grained parallelism: – loop pipelining

Loop Pipelining

Motivating Examplefor (int i = 0; i < N; i++) {

sum[i] = a + b + c + d}

+

a b

+

c

+

d

cycle

1

2

3

• Cycles: 3N• Adders: 3• Utilization: 33%

Loop PipeliningCycle 1 2 3 4 5 … N N+1 N+2

i=0 + + +

i=1 + + +

i=3 + + +

…. …. … …. …

i=N-2 + + +

i=N-1 + + +

• Cycles: N+2 (~1 cycle per iteration)• Adders: 3• Utilization: 100% in steady state

Steady State

Loop Pipelining Example

for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]

}• Each iteration requires:

• 2 loads from memory• 1 store

• No dependencies between iterations



}• Cycle latency of operations:

• Load: 2 cycles• Store: 1 cycle• Add: 1 cycle

• Single memory port

LLVM Instructionsfor (int i = 0; i < N; i++) {

a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5%scevgep6 = getelementptr

%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr

%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb


a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]






a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]






a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]






a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]





Scheduling LLVM Instructions




• There are no dependencies between iterations

Cycle:

Scheduling LLVM Instructions




• There are no dependencies between iterations

Memory Port Conflict

Cycle:



}• Initiation Interval (II)

• Constant time interval between starting successive iterations of the loop

• The loop requires 6 cycles per iteration (II=6)• Can we do better?

Minimum Initiation Interval

• Resource minimum II:– Due to limited # of functional units– ResMII = Uses of functional unit

# of functional units• Recurrence minimum II:

– Due to loop carried dependencies• Minimum II = max(ResMII, RecMII)

Resource Constraints

• Assume unlimited functional units (adders, …)• Only constraint: single ported memory controller• Reservation table:

• The resource minimum initiation interval is 3

Iterative Modulo Scheduling

• There are no loop carried dependencies so Minimum II = ResMII = 3

• Iterative: Not always possible to schedule the loop for minimum II

II = minII

Attempt to modulo schedule loop with II II = II + 1

Fail

Success


• Operations in the loop that execute in cycle:i

• Must also execute in cycles:i + k*II k = 0 to N-1

• Therefore to detect resource conflicts look in the reservation table under cycle:

(i-1) mod II + 1• Hence the name “modulo scheduling”

New Pipelined Schedule

Modulo Reservation Table

• Store couldn’t be scheduled in cycle 6 • Slot = (6-1) mod 3 + 1 = 3 • Already taken by an earlier load


• Now we have a valid schedule for II=3• We need to construct the loop kernel,

prologue, and epilogue• The loop kernel is what is executed when the

pipeline is in steady state– The kernel is executed every II cycles

• First we divide the schedule into stages of II cycles each

Pipeline Stages

00

Stage: 1 2 3

Pipelined Loop Iterations

i=0 i=1Stage 1

3 Cycles

i=0

i=2 i=3

i=4

i=3

i=1 i=2

i=0 i=1 i=4

i=4

i=3

i=2

Stage 2

Stage 3

Prologue Kernel (Steady State)

Epilogue

Loop Dependencies

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

a[j] = b[i] + a[j-1];

• May cause non-zero recurrence min II.• Several papers in FPGA 2013 deal with

discovering/optimizing loop dependencies

Depends on previous iteration

Limitations and Current Research

LegUp HLS Limitations

• HLS will likely do better for datapath-oriented parts of a design.

• Results likely quite sensitive to how loops are structured in your C code.

• Difficult for HLS to “beat” optimized structured HW design.

FPGA/Altera-Specific Aspects of LegUp

• Memory – On-chip (AltSyncRAM),

off-chip (DDR2/SDRAM controller)• IP cores

– Divider, floating point units• On-chip SOC interconnect

– Avalon interface• LegUp-generated Verilog fairly FPGA-agnostic:

– Not difficult to migrate to target ASICs

Current Research Work

• Impact of compiler optimizations on HLS• Enhanced parallel accelerator support

– Combining Pthreads+OpenMP• Smaller processor• Improved loop pipelining• Software fallback for bitwidth-optimized

accelerators• Enhanced GUI to display CDFG connected

with the schedule

Current Work: PCIe Support

• Enable use of LegUp-generated accelerators in an HPC environment– Communicating with an x86

processor via PCIe

• Message passing or memory transfers– Software API for fpga_malloc,

fpga_free, send, receive

• DE4 / Stratix IV support in next LegUp release

On to the Labs!

Documents

High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February