Upload
osborn-morton
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
High-Level Synthesis with LegUpA Crash Course for Users and Researchers
Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi
11 February 2013ACM FPGA Symposium
Monterey, CADept. of Electrical and Computer EngineeringUniversity of Toronto
LegUpLegUp
LegUpLegUp
LegUp
LegUp
LegUp
LegUp
LegUp
Hong Kong Berlin
Tokyo New York City
Tutorial Outline
• Overview of LegUp and its algorithms (60 min)• Labs (“hands on” via VirtualBox)
– Lab 1: Using the LegUp Framework (30 min)– Break– Lab 2: Adding resource constraints (30 min)– Lab 3: Changing How LegUp implements
hardware (30 min)
Project Motivation
• Hardware design has advantages over software:– Speed– Energy-efficiency
• Hardware design is difficult and skills are rare:– 10 software engineers for every hardware engineer*
• We need a CAD flow that simplifies hardware design for software engineers
*US Bureau of Labour Statistics ‘08
Top-Level Vision
Program code
C CompilerProcessor
(MIPS)
Self-ProfilingProcessor
Profiling Data:
Execution CyclesPower
Cache Misses
High-levelsynthesis Suggested
programsegments to
target to HWFPGA fabric
P Hardenedprogramsegments
Altered SW binary (calls HW accelerators)
int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum);}....
LegUp: Key Features• C to Verilog high-level synthesis• Many benchmarks (incl. 12 CHStone)• MIPS processor (Tiger)• Hardware profiler• Automated verification tests• Open source, freely downloadable
– Like ABC (Synthesis) or VPR (Place & Route)– 600+ downloads since March 2011– http://legup.eecg.utoronto.ca
FPGA
System Architecture
MIPS ProcessorHardware
Accelerator
AVALON INTERFACE
Hardware Accelerator
Memory ControllerOn-Chip Cache
Memory
Off-Chip MemoryALTERA DE2 or DE4 Board
Cyclone II or Stratix IV
Memory Memory
High-Level Synthesis Framework• Leverage LLVM compiler infrastructure:
– Language support: C/C++– Standard compiler optimizations– More on this shortly
• We support a large subset of ANSI C: Supported UnsupportedFunctions Dynamic MemoryArrays, Structs RecursionGlobal VariablesPointer ArithmeticFloating Point
tAddr+= V1tAddr += (tAddr << 8)tAddr ^= (tAddr >> 4)b = (tAddr >> B1) & B2a = (tAddr + (tAddr << A1)) >> A2fNum = (a ^ tab[b])
Address Hash(in hardware)
Hardware Profiler Architecture
MIPS P Instr. $
Op Decoderret call
instr
0 1
PC
function #
targetaddress
F#
count
Popped F#(ret | call)
PC
counter+ 0
1
reset
0
Incr. when PC changes
Counter StorageMemory
(for all functions)
Call Stack
count
Data Counter(for current function)
See paper IEEE ASAP’11
• Monitor instr. bus to detect function call/ret.
• Call: Hash (in HW) from function address to index; push to stack.
• Ret: pop function index from stack.
• Use function indexes to associate profiling data (e.g. cycles, power) with counters.
Processor/Accelerator Hybrid Flow
int main () {…sum = dotproduct(N);...
}
int dotproduct(int N) {…for (i=0; i<N; i++) {
sum += A[i] * B[i];}return sum;
}
Processor/Accelerator Hybrid Flow
int main () {…sum = dotproduct(N);...
}
int dotproduct(int N) {…for (i=0; i<N; i++) {
sum += A[i] * B[i];}return sum;
}
#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C
int legup_dotproduct(int N) {*dotproduct_ARG1 = (volatile int) N;*dotproduct_STATUS = 1;return *dotproduct_DATA;
}
Processor/Accelerator Hybrid Flow
int main () {…sum = dotproduct(N);...
}
set_accelerator_function “dotproduct”
HW Accelerator
HLS
int main () {…sum = dotproduct(N);...
}
Processor/Accelerator Hybrid Flow#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C
int legup_dotproduct(int N) {*dotproduct_ARG1 = (volatile int) N;*dotproduct_STATUS = 1;return *dotproduct_DATA;
}
sum = legup_dotproduct(N);
int main () {…
...}
Processor/Accelerator Hybrid Flow#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C
int legup_dotproduct(int N) {*dotproduct_ARG1 = (volatile int) N;*dotproduct_STATUS = 1;return *dotproduct_DATA;
}
MIPS Processor
SW
sum = legup_dotproduct(N);
How Does LegUp Handle Memory and Pointers?
• LegUp stores each array in a separate FPGA BRAM• BRAM data width matches the data in the array• Each BRAM is identified by a 9-bit tag• Addresses consist of the RAM tag and array index:
• A shared memory controller uses the tag bit to determine which BRAM to read or write from
• The array index is the address passed to the BRAM
9-bit Tag 23-bit Index31 22 023
Pointer Example
• We have two arrays in the C function:– int A[100], B[100]
• Tag 0 is reserved for NULL pointers• Tag 1 is reserved for off-chip memory• Assign tag 2 to array A and tag 3 to array B• Address of A[3]: Address of B[7]:
Tag=2 Index=331 02223
Tag=3 Index=731 02223
FF FF
Shared Memory Controller
• Both arrays A and B have 100 element BRAMs• Load from pointer D:
Tag=2 Index=1331 02223
A[0]0
...
A[13]
….
13
BRAM Tag=2A[99]99
B[0]0
...
B[13]
….
13
BRAM Tag=3B[99]99
3
2A[13]
32
3232
Core Benchmarks (+Many More)• 12 CHStone Benchmarks (JIP’09) and Dhrystone
– Too large/complex for academic HLS tools• Include golden input/output test vectors
• Not supported by academic toolsCategory Benchmarks Lines of C code
Arithmetic 64-bit double precision: add, mult, div, sin
376 – 755
Encryption AES, Blowfish, SHA 716 – 1,406
Processor MIPS processor 232
Media JPEG decoder, Motion, GSM, ADPCM 393 – 1,692
General Dhrystone 491
Experimental ResultsLegUp 1.0 (2011) for Cyclone II
1. Pure software on MIPS
Hybrid (software/hardware):2. Second most compute-intensive function
(and descendants) in H/W3. Same as 2 but with most compute-intensive
4. Pure hardware using LegUp5. Pure hardware using eXCite (commercial tool)
Experimental Results
MIPS-S
W
LegU
p-Hyb
rid2
LegU
p-Hyb
rid1
LegU
p-HW
eXCite-H
W0
500
1000
1500
2000
2500
0
5000
10000
15000
20000
25000
30000
35000
40000
# of LEsExec. time
Exec
ution
tim
e (g
eom
etric
mea
n)
# of
LEs
(geo
met
ric m
ean)
Comparison: LegUp vs eXCite• Benchmarks compiled to hardware• eXCite: Commercial high-level synthesis tool
• Couldn’t compile Dhrystone
Geomean LegUp eXcite LegUp/eXciteCircuit Runtime (μs) 292 357 0.82 (1.22x)Logic Elements 15,646 13,101 1.19Area-Delay Product 4.57M 4.68M 0.98
Energy Consumption
MIPS-S
W
LegU
p-Hyb
rid2
LegU
p-Hyb
rid1
LegU
p-HW
eXCite-H
W -
100,000
200,000
300,000
400,000
500,000
600,000
Ener
gy (μ
J) (g
eom
etric
mea
n)
18x less energy than software
Current Release: LegUp 3.0
• Loop pipelining• Dual and multi-ported memory support• Bitwidth minimization• Multi-pumping DSP units for area reduction• Alias analysis for dependency checks• Parallel accelerators via Pthreads & OpenMP
Results now considerably better than LegUp 1.0 release
LegUp 3.0 vs. LegUp 1.0
adpcm ae
s
blowfishdfad
ddfdiv
dfmul
dfsin
dhrystone
gsm jpegmips
motion sha
geomea
n0.4
0.6
0.8
1
1.2
1.4
1.6
Wall-Clock TimeCyclesFmaxLEs
CHStone Benchmark Circuit
LegU
p 3.
0/Le
gUp
1.0
Ratio
Wall-clock time: 16% betterCycle latency: 31% better
FMax: 18% worseLEs (area): 28% better
LLVM Compiler and HLS Algorithms
LLVM Compiler
• Open-source compiler framework.– http://llvm.org
• Used by Apple, NVIDIA, AMD, others.• Competitive quality with gcc.• LegUp HLS is a “back-end” of LLVM.
• LLVM: low-level virtual machine.
LLVM Compiler
• LLVM will compile C code into a control flow graph (CFG)
• LLVM will perform standard optimizations– 50+ different optimizations in LLVM
C Programint FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return sum;}....
LLVM
Compiler
CFG
BB0
BB1
BB2
Control Flow Graph
• Control flow graph is composed of basic blocks• basic block: is a sequence of instructions
terminated with exactly one branch– Can be represented by an acyclic data flow graph:
CFG
BB0
BB1
BB2
load load
+
load
+
store
LLVM Details
• Instructions in basic blocks are primitive computational operations:– shift, add, divide, xor, and, etc.
• Or are control-flow operations:– branch, call, etc.
• The CDFG is represented in LLVM’s intermediate representation (IR)– IR is machine-independent assembly code.
High-Level Synthesis FlowC Compiler
(LLVM)C Program
Allocation
Scheduling
Binding
Target H/W Characterization
RTL Generation
User Constraints• Timing• Resource
Synthesizable Verilog
Optimized LLVM IR
Scheduling
• Scheduling: is the task of scheduling operations into clock cycles using a finite state machine
load load
+ load
+
store
State 1
State 0
State 2
State 3
FSM Schedule
Binding
• Binding: is the task of assigning scheduled operations to functional units in the datapath
load load
+ load
+
store
Schedule Datapath
2-port RAM +
FF
High-Level Synthesis: Scheduling
SDC Scheduling
• SDC System of Difference Constraints– Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC
formulation”. DAC 2006: 433-438.
• Basic idea: formulate scheduling as a mathematical optimization problem– Linear objective function + linear constraints
(==, <=, >=).• The problem is a linear program (LP)
– Solvable in polynomial time with standard solvers
Define Variables• For each operation i to
schedule, create a variable ti.
• The ti’s will hold the cycle # in which each op is scheduled.
• Here we have:– tadd, tshift, tsub
+ <<
-
Data flow graph (DFG): already accessible in LLVM.
Dependency Constraints
• In this example, the subtract can only happen after the add and shift.
• tsub – tadd >= 0
• tsub – tshift >= 0
• Hence the name difference constraints.
add shift
sub
Handling Clock Period Constraints
• Target period: P (e.g., 10 ns)• For each chain of dependant
operations in DFG, estimate the path delay D (LegUp’s models)– E.g.: D from mod -> or = 23 ns.
• Compute: R = ceiling(D/P) - 1– E.g.: R = 2
• Add the difference constraint:– tor - tmod >= 2
mod
xor
shr
or
Resource Constraints
• Restriction on # of operations of a given type that can execute in a cycle
• Why we need it?– Want to use dual-port RAMs in FPGA
• Allow up to 2 load/store operations in a cycle
– Floating point• Do not want to instantiate many FP cores of a given
type, probably just one• Scheduling must honour # of FP cores available
Resource Constraints in SDC
• Res-constrained scheduling is NP-hard.• Implemented approach in [Cong & Zhang DAC2006]
+ +
+
+
+ +
+
+A B
C
D
E F
G
H
Say want to schedule with only have 2 addersin the HW (lab #2)
Add SDC Constraints
• Generate a topological ordering of the resource-constrained operations.
• Say constrained to 2 adders in HW.• Starting at C in the ordering, create a
constraint: tC – tA > 0
• Next consider, E, add constraint: tE - tB > 0• Continue to the end• Resulting schedule will have <= 2 adds / cycle
A B C E F D G H
ASAP Objective Function
• Minimize the sum of the variables:
• Operations will be scheduled as early as possible, subject to the constraints
• LP program solvable in polynomial time
High-Level Synthesis: Binding
High-Level Synthesis: Binding
• Weighted bipartite matching-based binding– Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted
matching”. DAC 1990: 499-504.
• Finds the minimum weighted matching of a bipartite graph at each step – Solve using the Hungarian Method (polynomial)
operations
hardware functional units
edge costs
Binding
• Bind the following scheduled program
State 0
State 1
State 2
State 3
Binding
• Resource Sharing: requires 3 multipliers
State 0
State 1
State 2
State 3
State 0
State 1
State 2
State 3
Binding
• Bind the first cycle Functional Units
1
1
1
State 0
State 1
State 2
State 3
Binding
• Bind the second cycle Functional Units
2
2
1
State 0
State 1
State 2
State 3
Binding
• Bind the third cycle Functional Units
2
2
2
State 0
State 1
State 2
State 3
Binding
• Bind the fourth cycle Functional Units
3
2
2
Binding
• Required Multiplexing: Functional Units
3
2
2
High-Level Synthesis: Challenges
• Easy to extract instruction level parallelism using dependencies within a basic block
• But C code is inherently sequential and it is difficult to extract higher level parallelism
• Coarse-grained parallelism: – function pipelining
• Fine-grained parallelism: – loop pipelining
Loop Pipelining
Motivating Examplefor (int i = 0; i < N; i++) {
sum[i] = a + b + c + d}
+
a b
+
c
+
d
cycle
1
2
3
• Cycles: 3N• Adders: 3• Utilization: 33%
Loop PipeliningCycle 1 2 3 4 5 … N N+1 N+2
i=0 + + +
i=1 + + +
i=3 + + +
…. …. … …. …
i=N-2 + + +
i=N-1 + + +
• Cycles: N+2 (~1 cycle per iteration)• Adders: 3• Utilization: 100% in steady state
Steady State
Loop Pipelining Example
for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]
}• Each iteration requires:
• 2 loads from memory• 1 store
• No dependencies between iterations
Loop Pipelining Example
for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]
}• Cycle latency of operations:
• Load: 2 cycles• Store: 1 cycle• Add: 1 cycle
• Single memory port
LLVM Instructionsfor (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5%scevgep6 = getelementptr
%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr
%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb
LLVM Instructionsfor (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5%scevgep6 = getelementptr
%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr
%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb
LLVM Instructionsfor (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5%scevgep6 = getelementptr
%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr
%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb
LLVM Instructionsfor (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5%scevgep6 = getelementptr
%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr
%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb
LLVM Instructionsfor (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5%scevgep6 = getelementptr
%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr
%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb
Scheduling LLVM Instructions
for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]
}• Each iteration requires:
• 2 loads from memory• 1 store
• There are no dependencies between iterations
Cycle:
Scheduling LLVM Instructions
for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]
}• Each iteration requires:
• 2 loads from memory• 1 store
• There are no dependencies between iterations
Memory Port Conflict
Cycle:
Loop Pipelining Example
for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]
}• Initiation Interval (II)
• Constant time interval between starting successive iterations of the loop
• The loop requires 6 cycles per iteration (II=6)• Can we do better?
Minimum Initiation Interval
• Resource minimum II:– Due to limited # of functional units– ResMII = Uses of functional unit
# of functional units• Recurrence minimum II:
– Due to loop carried dependencies• Minimum II = max(ResMII, RecMII)
Resource Constraints
• Assume unlimited functional units (adders, …)• Only constraint: single ported memory controller• Reservation table:
• The resource minimum initiation interval is 3
Iterative Modulo Scheduling
• There are no loop carried dependencies so Minimum II = ResMII = 3
• Iterative: Not always possible to schedule the loop for minimum II
II = minII
Attempt to modulo schedule loop with II II = II + 1
Fail
Success
Iterative Modulo Scheduling
• Operations in the loop that execute in cycle:i
• Must also execute in cycles:i + k*II k = 0 to N-1
• Therefore to detect resource conflicts look in the reservation table under cycle:
(i-1) mod II + 1• Hence the name “modulo scheduling”
New Pipelined Schedule
Modulo Reservation Table
• Store couldn’t be scheduled in cycle 6 • Slot = (6-1) mod 3 + 1 = 3 • Already taken by an earlier load
Iterative Modulo Scheduling
• Now we have a valid schedule for II=3• We need to construct the loop kernel,
prologue, and epilogue• The loop kernel is what is executed when the
pipeline is in steady state– The kernel is executed every II cycles
• First we divide the schedule into stages of II cycles each
Pipeline Stages
00
Stage: 1 2 3
Pipelined Loop Iterations
i=0 i=1Stage 1
3 Cycles
i=0
i=2 i=3
i=4
i=3
i=1 i=2
i=0 i=1 i=4
i=4
i=3
i=2
Stage 2
Stage 3
Prologue Kernel (Steady State)
Epilogue
Loop Dependencies
for (i = 0; i < M; i++)for (j = 0; j < N; j++)
a[j] = b[i] + a[j-1];
• May cause non-zero recurrence min II.• Several papers in FPGA 2013 deal with
discovering/optimizing loop dependencies
Depends on previous iteration
Limitations and Current Research
LegUp HLS Limitations
• HLS will likely do better for datapath-oriented parts of a design.
• Results likely quite sensitive to how loops are structured in your C code.
• Difficult for HLS to “beat” optimized structured HW design.
FPGA/Altera-Specific Aspects of LegUp
• Memory – On-chip (AltSyncRAM),
off-chip (DDR2/SDRAM controller)• IP cores
– Divider, floating point units• On-chip SOC interconnect
– Avalon interface• LegUp-generated Verilog fairly FPGA-agnostic:
– Not difficult to migrate to target ASICs
Current Research Work
• Impact of compiler optimizations on HLS• Enhanced parallel accelerator support
– Combining Pthreads+OpenMP• Smaller processor• Improved loop pipelining• Software fallback for bitwidth-optimized
accelerators• Enhanced GUI to display CDFG connected
with the schedule
Current Work: PCIe Support
• Enable use of LegUp-generated accelerators in an HPC environment– Communicating with an x86
processor via PCIe
• Message passing or memory transfers– Software API for fpga_malloc,
fpga_free, send, receive
• DE4 / Stratix IV support in next LegUp release
On to the Labs!