View
237
Download
1
Tags:
Embed Size (px)
Citation preview
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits
Frank Vahid Professor
Department of Computer Science and EngineeringUniversity of California, Riverside
Associate Director, Center for Embedded Computer Systems, UC Irvine
Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, Motorola/Freescale
Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), Kris Miller (MS 2007), David Sheldon (3rd yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy (1st
yr PhD)
Frank Vahid, UC Rivers
ide
2/57
Outline FPGAs
Overview Hard to program --> Binary-level
partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary
Frank Vahid, UC Rivers
ide
3/57
FPGAs FPGA -- Field-Programmable Gate Array
Off-the-shelf chip, evolved in early 1990s
Implements custom circuit just by downloading stream of bits (“software”)
Basic idea: N-address memory can implement N-input combinational logic
(Note: no “gate array” inside) Memory called Lookup Table, or LUT
FPGA “fabric” Thousands of small (~3-input) LUTs –
larger LUTs are inefficient Thousands of switch matrices (SM) for
programming interconnections Possibly additional hard core
components, like multipliers, RAM, etc. CAD tools automatically map desired
circuit onto FPGA fabric
*
+
*
“Lookup table” -- LUT
a b
F G
a1
a0
4x2 Memory
ab
1010
1110
d1 d0
F G
00011011
Implement particular circuit just by
downloading particular bits
LUT
SM
LUT
SM
LUT
LUT LUT LUT
*
Frank Vahid, UC Rivers
ide
4/57
FPGAs: "Programmable" like Microprocessors -- Download Bits
Processor Processor
001010010……
001010010……
0010…
Bits loaded into program memory
Microprocessor Binaries
FPGA0010
…SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
SM
SM
SM
SM
CLB
01
11
11 01
001111...
10 11
Configurable logic block
(CLB) -- (LUT plus flip-flops)
addra
bc
x y
00010111
01101001
1
0
ba
cx
y
SM (Switch Matrix)
a
b
a or b
a or b
000101...
001010010……
01110100...
Bits loaded into LUTs, CLBs, and SMs
FPGA Binaries
Frank Vahid, UC Rivers
ide
5/57
FPGAs as Coprocessors
Coprocessor -- Accelarates application kernel by implementing as circuit
ASIC coprocessor known to speedup many application kernels
Energy advantages too (e.g., Henkel’98, Rabaey’98, Stitt/Vahid’04)
FPGA coprocessor also gives speedup/energy benefits (Stitt/Vahid IEEE D&T’02, IEEE TECS’04)
Con: more silicon (~20x), ~4x performance overhead (Rose FPGA'06)
Pro: platform fully programmable Shorter time-to-market, smaller non-
recurring engineering (NRE) cost, low cost devices available, late changes (even in-product)
ASICProc.
Application
FPGAProc.
Application
Frank Vahid, UC Rivers
ide
6/57
FPGAs as Coprocessors Surprisingly Competitive to ASIC
FPGA 34% energy savings versus ASIC’s 48% (Stitt/Vahid IEEE D&T’02, IEEE TECS’04)
A jet isn’t as fast as a rocket, but it sure beats driving
0%
20%
40%
60%
80%
100%
120%
% E
nerg
y v
s S
w O
nly
ASIC
FPGA
Frank Vahid, UC Rivers
ide
7/57
FPGA – Why (Sometimes) Better than Microprocessor
x = (x >>16) | (x <<16);x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00);x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0);x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc);x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa);
C Code for Bit Reversal
Hardware for Bit Reversal
Bit Reversed X Value
Bit Reversed X ValueBit Reversed X Value
. . . . . . . . . . .
. . . . . . . . . . .
Original X Value
ProcessorFPGA
Requires only 1 cycle (speedup of 32x to 128x)
sll $v1[3],$v0[2],0x10srl $v0[2],$v0[2],0x10or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x8and $v1[3],$v1[3],$t5[13]sll $v0[2],$v0[2],0x8and $v0[2],$v0[2],$t4[12]or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x4and $v1[3],$v1[3],$t3[11]sll $v0[2],$v0[2],0x4and $v0[2],$v0[2],$t2[10]...
Binary
Compilation
ProcessorProcessor
Requires between 32 and 128 cycles
In general, because of concurrency, from bit-level to task level
Frank Vahid, UC Rivers
ide
8/57
for (i=0; i < 128; i++) y[i] += c[i] * x[i]......
FPGAs: Why (Sometimes) Better than Microprocessor
for (i=0; i < 128; i++) y[i] += c[i] * x[i]......
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
C Code for FIR Filter
Processor Processor
1000’s of instructions Several thousand cycles
Hardware for FIR Filter
Processor FPGA
~ 7 cycles Speedup > 100x
Frank Vahid, UC Rivers
ide
9/57
FPGAs are Hard to Program Synthesis from hardware
description languages (HDLs) VHDL, Verilog Great for parallelism But non-standard languages,
manual partitioning SystemC a good step
C/C++ partitioning compilers Use language subset Growing in importance But special compiler limits
adoption
BinaryApplic.
ProfilingSpecial Compiler
BinaryBinary
NetlistFPGA BinaryMicropr. Binary
Includessynthesis, tech. map,pace & route
FPGAProc.100 software writers for every CAD user
Only about 15,000 CAD seats worldwide; millions of compiler seats
Frank Vahid, UC Rivers
ide
10/57
Binary-Level Partitioning Helps Binary-level partitioning
Stitt/Vahid, ICCAD’02 Recent commercial product: Critical
Blue [www.criticalblue.com] Partition and synthesize starting
from SW binary Advantages
Any compiler, any language, multiple sources, assembly/object support, legacy code support
Better incorporation into toolflow
Disadvantage Quality loss due to lack of high-
level language constructs? (More later)
BinarySW
ProfilingStandard Compiler
BinaryBinary
Binary-level
Partitioner
NetlistNetlistModified Binary
Traditionalpartitioningdone here
Less disruptive,back-end tool
Includessynthesis, tech. map,place & route
FPGAProc.
Frank Vahid, UC Rivers
ide
11/57
Outline FPGAs
Overview Hard to program --> Binary-level
partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary
Frank Vahid, UC Rivers
ide
12/57
Warp Processing Observation: Dynamic binary recompilation
to a different microprocessor architecture is a mature commercial technology e.g., Modern Pentiums translate x86 to VLIW
Question: If we can recompile binaries to FPGA circuits, can we
dynamically recompile binaries to FPGA circuits?
Frank Vahid, UC Rivers
ide
13/57
µP
FPGAOn-chip CAD
Warp Processing Idea
Profiler
Initially, software binary loaded into instruction memory
11
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary
Frank Vahid, UC Rivers
ide
14/57
µP
FPGAOn-chip CAD
Warp Processing Idea
ProfilerI Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryMicroprocessor executes
instructions in software binary
22
Time EnergyµP
Frank Vahid, UC Rivers
ide
15/57
µP
FPGAOn-chip CAD
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryProfiler monitors instructions
and detects critical regions in binary
33
Time Energy
Profiler
add
add
add
add
add
add
add
add
add
add
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Critical Loop Detected
Frank Vahid, UC Rivers
ide
16/57
µP
FPGAOn-chip CAD
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD reads in critical
region44
Time Energy
Profiler
On-chip CAD
Frank Vahid, UC Rivers
ide
17/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD decompiles critical
region into control data flow graph (CDFG)
55
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Frank Vahid, UC Rivers
ide
18/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD synthesizes
decompiled CDFG to a custom (parallel) circuit
66
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
Frank Vahid, UC Rivers
ide
19/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD maps circuit onto
FPGA77
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
FPGA
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
Frank Vahid, UC Rivers
ide
20/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary88
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more
Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA
Ret reg4
FPGA
Time Energy
Software-only“Warped”
Frank Vahid, UC Rivers
ide
21/57
µP
FPGAOn-chip CAD
Warp Processing Idea
ProfilerI Mem
D$µP
Likely multiple microprocessors per chip, serviced by one on-chip CAD block
µPµPµPµPµP
Frank Vahid, UC Rivers
ide
22/57
Warp Processing: Trend Towards Processor/FPGA Programmable Platforms
FPGAs with hard core processors
FPGAs with soft core processors
Computer boards with FPGAs
Cray XD1. Source: FPGA journal, Apr’05
Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera
Xilinx Spartan. Source: Xilinx
Frank Vahid, UC Rivers
ide
23/57
Warp Processing: Trend Towards Processor/FPGA Programmable Platforms
Programming a key challenge Soln 1: Compile high-level
language to custom binaries using both microprocessor and FPGA
Soln 2: Use standard microprocessor binaries, dynamically re-compile (warp)
Cons: Less high-level information when
compiling, less optimization Pros:
Available to all software developers, not just specialists
Data dependent optimization Most importantly, standard
binaries enable “ecosystem” among tools, architecture, and applications
Architectures
Applications Tools
Standard binaries
Standard binary (and ecosystem) concept presently absent in FPGAs and other new
programmable platforms
BinarySW
ProfilingStandard Compiler
BinaryBinary
ProfilingCAD Tools
Traditionalpartitioningdone here
FPGAProc. FPGAProc. FPGAProc.
ProfilingCAD Tools
FPGAProc.
ProfilingCAD Tools
Frank Vahid, UC Rivers
ide
24/57
Outline FPGAs
Overview Hard to program --> Binary-level
partitioning Warp processing Techniques underlying warp
processing Overall warp processing results Directions and Summary
Frank Vahid, UC Rivers
ide
25/57
µPI$
D$
FPGA
Profiler
On-chip CAD
Warp Processing Steps (On-Chip CAD)
Technology mapping,
placement, and routing
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
Frank Vahid, UC Rivers
ide
26/57
Warp Processing – Profiling and Partitioning
Applications spend much time in small amount of code
90-10 rule Observed 75-4 rule for
MediaBench, NetBench Developed efficient
hardware profiler Gordon-Ross/Vahid, CASES'04, IEEE
Trans. on Comp 06 Partitioning straightforward
Try most critical code first 0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 10
% execution time
% size of program
µPI$D$
FPGA
Profiler
On-chip CAD
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
Frank Vahid, UC Rivers
ide
27/57
Warp Processing – Decompilation Synthesis from binary has a key challenge
High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – huge overheads Need to recover high-level information
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
0
1
2
3
4
5
6
7
8
SpeedupEnergy Size
g3faxadpcmcrcdesenginejpegsumminv42avg
Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs.
microprocessor alone
Frank Vahid, UC Rivers
ide
28/57
Warp Processing – Decompilation Solution –Recover high-level information from
binary: decompilation Extensive previous work (for different purposes)
Adapted Developed new decompilation methods also
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Control/Data Flow Graph CreationOriginal C Code
Corresponding Assembly
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Data Flow Analysis
long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}
Function Recovery
long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}
Control Structure Recovery
long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }
Array Recovery
Almost Identical Representations
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
Frank Vahid, UC Rivers
ide
29/57
New Decompilation Method: Loop Rerolling
Problem: Compiler unrolling of loops (to expose parallelism) causes synthesis problems:
Huge input (slow), can’t unroll to desired amount, can’t use advanced loop methods (loop pipelining, fusion, splitting, ...)
Solution: New decompilation method: Loop Rerolling
Identify unrolled iterations, compact into one iteration
for (int i=0; i < 3; i++) accum += a[i];
Ld reg2, 100(0)Add reg1, reg1, reg2 Ld reg2, 100(1)Add reg1, reg1, reg2Ld reg2, 100(2)Add reg1, reg1, reg2
Loop Unrolling for (int i=0; i<3;i++)
reg1 += array[i];
Loop Rerolling
Frank Vahid, UC Rivers
ide
30/57
Loop Rerolling: Identify Unrolled Iterations
x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;
Original C Code
Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring
Unrolled Loop
2 unrolled iterationsEach iteration = abc (Ld, Add, St)
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Binary
x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;
Unrolled Loop
Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D
Map to String
BABCABCD
String Representatio
n
Find consecutively repeating instruction sequences
abc c db
abcabcd c abcd d abcd d
dabcd
Suffix Tree
Derived from bioinformatics
techniques
Frank Vahid, UC Rivers
ide
31/57
Warp Processing – Decompilation
Study Synthesis after decompilation often quite
similar Almost identical performance, small area overhead
Example Cycles ClkFrq Time Area Cycles ClkFrq Time Area %TimeOverhead %AreaOverhead
bit_correlator 258 118 2.19 15 258 118 2.186 15 0% 0%fir 129 125 1.03 359 129 125 1.032 371 0% 3%udiv8 281 190 1.48 398 281 190 1.479 398 0% 0%prewitt 64516 123 525 2690 64516 123 524.5 4250 0% 58%mf9 258 57 4.5 1048 258 57 4.503 1048 0% 0%moravec 195072 66 2951 680 195072 70 2791 676 -6% -1%
Avg: -1% 10%
Synthesis from C Code Synthesis after Decompiling Binary
FPGA 2005
Frank Vahid, UC Rivers
ide
32/57
2. Deriving high-level constructs from binaries
Recent study of decompilation robustness
In presence of compiler optimizations, and instruction sets
Energy savings of 77%/76%/87% for MIPS/ARM/Microblaze
Example Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw SFIR Filter 1.000 0.089 11.2 0.923 0.070 14.2 1.000 0.085 11.8 0.999 0.084 11.9 1.000 0.040 25.3 0.549 0.015 68.4Beamformer 1.000 0.074 13.5 0.853 0.071 14.0 1.000 0.149 6.7 1.018 0.172 5.8 1.000 0.031 32.3 0.647 0.032 31.4Viterbi 1.000 0.136 7.4 0.891 0.152 6.6 1.000 0.131 7.6 0.957 0.126 7.9 1.000 0.060 16.7 0.765 0.017 59.0Crc 1.000 0.030 33.8 0.967 0.019 53.6 1.000 0.020 49.5 1.105 0.007 134.8 1.000 0.012 80.3 0.995 0.011 88.6Des 1.000 0.275 3.6 0.990 0.310 3.2 1.000 0.360 2.8 1.028 0.401 2.5 1.000 0.205 4.9 0.998 0.218 4.6Summin 1.000 0.111 9.0 0.899 0.145 6.9 1.000 0.183 5.5 0.684 0.128 7.8 n/a n/a n/a n/a n/a n/aBrev 1.000 0.120 8.3 0.976 0.129 7.7 1.000 0.156 6.4 1.476 0.153 6.5 1.000 0.011 90.2 0.951 0.009 106.5BITMNP01 1.000 0.114 8.8 0.985 0.113 8.8 1.000 0.188 5.3 0.988 0.186 5.4 1.000 0.112 8.9 0.999 0.115 8.7IDCTRN01 1.000 0.323 3.1 0.975 0.323 3.1 1.000 0.230 4.4 1.005 0.230 4.3 1.000 0.258 3.9 0.885 0.150 6.7PNTRCH01 1.000 0.196 5.1 0.945 0.196 5.1 1.000 0.325 3.1 0.963 0.313 3.2 n/a n/a n/a n/a n/a n/a
Average: 1.000 0.147 10.4 0.940 0.153 12.3 1.000 0.183 10.3 1.022 0.180 19.0 1.000 0.091 32.8 0.849 0.071 46.7Geo.Mean:1.000 0.124 8.4 0.939 0.122 8.7 1.000 0.150 7.0 1.008 0.134 8.3 1.000 0.053 19.0 0.831 0.037 27.4
O1 O3MIPS
O1 O1 O3ARM MIcroBlaze
O3
ICCAD’05DATE’04
Frank Vahid, UC Rivers
ide
33/57
Decompilation is Effective Even with High Compiler-Optimization Levels
Average Speedup of 10 Examples
0
5
10
15
20
25
30
Speedups similar on MIPS for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar on ARM for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar between ARM and MIPS
Complex instructions of ARM didn’t hurt synthesis
MicroBlaze speedups much larger
MicroBlaze is a slower microprocessor
-O3 optimizations were very beneficial to hardware
0
5
10
15
20
25
30
MIP
S -O1
MIP
S -O3
ARM -O
1
ARM -O
3
Micr
oBlaz
e -O
1
Micr
oBlaz
e -O
3
Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.
Frank Vahid, UC Rivers
ide
34/57
Decompilation Effectiveness In-Depth Study
Performed in-depth study with Freescale
H.264 video decoder Highly-optimized proprietary
code, not reference code Huge difference
Research question: Is synthesis from binaries competitive on highly-optimized code?
Several-month study
MPEG 2 H.264: Better quality, or smaller files, using more
computation
Frank Vahid, UC Rivers
ide
35/57
Optimized H.264
Larger than most benchmarks
H.264: 16,000 lines Previous work: 100 to
several thousand lines Highly-optimized
H.264: Many man-hours of manual optimization
10x faster than reference code used in previous works
Different profiling results Previous examples
~90% time in several loops H.264
~90% time in ~45 functions Harder to speedup
Function Name Instr %TimeCumulative SpeedupMotionComp_00 33 6.8% 1.1InvTransform4x4 63 12.5% 1.1FindHorizontalBS 47 16.7% 1.2GetBits 51 20.8% 1.3FindVerticalBS 44 24.7% 1.3MotionCompChromaFullXFullY24 28.6% 1.4FilterHorizontalLuma 557 32.5% 1.5FilterVerticalLuma 481 35.8% 1.6FilterHorizontalChroma133 39.0% 1.6CombineCoefsZerosInvQuantScan69 42.0% 1.7memset 20 44.9% 1.8MotionCompensate 167 47.7% 1.9FilterVerticalChroma 121 50.3% 2.0MotionCompChromaFracXFracY48 53.0% 2.1ReadLeadingZerosAndOne56 55.6% 2.3DecodeCoeffTokenNormal93 57.5% 2.4DeblockingFilterLumaRow272 59.4% 2.5DecodeZeros 79 61.3% 2.6MotionComp_23 279 63.0% 2.7DecodeBlockCoefLevels56 64.6% 2.8MotionComp_21 281 66.2% 3.0FindBoundaryStrengthPMB44 67.7% 3.1
Frank Vahid, UC Rivers
ide
36/57
C vs. Binary Synthesis on Opt. H.264
Binary partitioning competitive with source partitioning
Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information
needed for partitioning and synthesis
Speedup from C Partititioning
0
1
2
3
4
5
6
7
8
9
101 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Number of Functions in Hardware
Sp
ee
du
p
Speedup from C Partititioning
0
1
2
3
4
5
6
7
8
9
10
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Number of Functions in Hardware
Sp
ee
du
p
Speedup from C Partititioning
Speedup from Binary Partitioning
Frank Vahid, UC Rivers
ide
37/57
Warp Processing – Synthesis
ROCM - Riverside On-Chip Minimizer
Standard register-transfer synthesis
Logic synthesis – make it lean Combination of approaches from
Espresso-II [Brayton, et al., 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979]
Cost/benefit analysis of operations
Result Single expand phase instead of
multiple iterations Eliminate need to compute off-set
– reduces memory usage On average only 2% larger than
optimal solution
µPI$D$
FPGA
Profiler
On-chip CAD
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
Expand
Reduce
Irredundant
dc-seton-set off-set
Frank Vahid, UC Rivers
ide
38/57
Warp Processing – JIT FPGA Compilation
Hard – Routing is extremely compute/memory intensive
Solution – Jointly design CAD and FPGA architecture
Cost/benefit analysis Highly iterative process
µPI$D$
FPGA
Profiler
On-chip CAD
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
Frank Vahid, UC Rivers
ide
39/57
Warp-Targeted FPGA Architecture CAD-specialized configurable
logic fabric Simplified switch matrices
Directly connected to adjacent CLB
All nets are routed using only a single pair of channels
Allows for efficient routing Routing is by far the most time-
consuming on-chip CAD task Simplified CLBs
Two 3 input, 2 output LUTs Each CLB connected to adjacent
CLB to simplify routing of carry chains
Currently being prototyped by Intel (scheduled for 2006 Q3 shuttle)
0
0L
1
1L2L
2
3L
3
0123
0L1L2L
3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
DATE’04
µPI$
D$
FPGA
Profiler
On-chip CAD
Frank Vahid, UC Rivers
ide
40/57
Warp Processing – Technology Mapping
Dynamic Hardware/Software Partitioning: A First Approach, DAC’03A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04
ROCTM - Technology Mapping/Packing Decompose hardware circuit into DAG
Nodes correspond to basic 2-input logic gates (AND, OR, XOR, etc.)
Hierarchical bottom-up graph clustering algorithm Breadth-first traversal combining nodes to form single-output
LUTs Combine LUTs with common inputs to form final 2-output LUTs Pack LUTs in which output from one LUT is input to second LUT
JIT FPGA Compilation
Tech. Mapping/Packing
Placement
Logic Synthesis
Routing
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
Frank Vahid, UC Rivers
ide
41/57
Warp Processing – Placement ROCPLACE - Placement
Dependency-based positional placement algorithm Identify critical path, placing critical nodes in center of CLF Use dependencies between remaining CLBs to determine
placement Attempt to use adjacent CLB routing whenever possible
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB
CLB
Dynamic Hardware/Software Partitioning: A First Approach, DAC’03A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04
JIT FPGA Compilation
Tech. Mapping/Packing
Placement
Logic Synthesis
Routing
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
Frank Vahid, UC Rivers
ide
42/57
ROCR - Riverside On-Chip Router Requires much less memory than VPR as resource graph is smaller
10x faster execution time than VPR (Timing driven)
Produces circuits with critical path 10% shorter than VPR (Routablilty driven)
Warp Processing – Routing
Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04
0
10000
20000
30000
40000
50000
60000
70000
Benchmark
Me
mo
ry
Us
ag
e (
KB
)
VPR (RD) VPR (TD) ROCR
0
10
20
30
40
50
60
Benchmark
Ex
ec
uti
on
Tim
e (
s)
VPR (TD) ROCR
BinaryBinary
Decompilation
BinaryFPGA binary
Synthesis
Profiling & partitioning
Binary Updater
BinaryMicropr. Binary
Std. HW Binary
JIT FPGA compilation
JIT FPGA Compilation
Tech. Mapping/Packing
Placement
Logic Synthesis
Routing
Frank Vahid, UC Rivers
ide
43/57
Outline FPGAs
Overview Hard to program --> Binary-level
partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary
Frank Vahid, UC Rivers
ide
44/57
Experiments with Warp Processing
Warp Processor ARM/MIPS plus our fabric Riverside on-chip CAD tools to
map critical region to configurable fabric
Requires less than 2 seconds on lean embedded processor to perform synthesis and JIT FPGA compilation
Traditional HW/SW Partitioning ARM/MIPS plus Xilinx Virtex-E
FPGA Manually partitioned software
using VHDL VHDL synthesized using Xilinx
ISE 4.1
ARMI$
D$
FPGA
Profiler
On-Chip CAD
ARMI$
D$
Xilinx Virtex-E FPGA
Frank Vahid, UC Rivers
ide
45/57
191 113 130
0
10
20
30
40
50
60
70
80
Spee
dup
Warp Proc.
Xilinx Virtex-E
Warp ProcessorsPerformance Speedup (Most Frequent Kernel Only)
Average kernel speedup of 41, vs. 21 for Virtex-E
SW Only Execution
WCLA simplicity results in faster HW
circuits
Frank Vahid, UC Rivers
ide
46/57
0
2
4
6
8
10
12
14
16
18
Spee
dup
Warp Proc.
Warp ProcessorsPerformance Speedup (Overall, Multiple Kernels)
Average speedup of 7.4 Energy reduction of 38% - 94%
SW Only Execution
Assuming 100 MHz ARM, and fabric clocked at rate determined by
synthesis
Frank Vahid, UC Rivers
ide
47/57
Warp Processors - ResultsExecution Time and Memory Requirements
60 MB
9.1 s
Xilinx ISE
3.6MB1.4s
DPM (CAD) (75MHz ARM7)
3.6MB0.2 s
DPM (CAD)
Frank Vahid, UC Rivers
ide
48/57
Outline FPGAs
Overview Hard to program --> Binary-level
partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary
Frank Vahid, UC Rivers
ide
49/57
Direction: Coding Guidelines for Partitioning?
In-depth H264 study led to a question: Why aren’t speedups (from binary or C) closer to “ideal” (0-time per fct)
We thus examined dozens of benchmarks in more detail Are there simple coding guidelines that result in better
speedups when kernels are synthesized to circuits?
Speedup from C Partititioning
0
1
2
3
4
5
6
7
8
9
101 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Number of Functions in Hardware
Sp
ee
du
p
Speedup from C Partititioning
0
1
2
3
4
5
6
7
8
9
10
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Number of Functions in Hardware
Sp
ee
du
p
Speedup from C Partititioning
Speedup from Binary Partitioning
0
1
2
3
4
5
6
7
8
9
10
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Number of Functions in Hardware
Sp
ee
du
p
Ideal Speedup (Zero-time Hw Execution)
Speedup from C Partititioning
Speedup from Binary Partitioning
Frank Vahid, UC Rivers
ide
50/57
Synthesis-Oriented Coding Guidelines
Pass by value-return Declare a local array and copy in all data needed by a
function (makes lack of aliases explicit) Function specialization
Create function version having frequent parameter-values as constants
void f(int width, int height ) {
. . . .
for (i=0; i < width, i++)
for (j=0; j < height; j++)
. . .
. . .
}
void f_4_4() {
. . . .
for (i=0; i < 4, i++)
for (j=0; j < 4; j++)
. . .
. . .
}
Bounds are explicit so loops are now unrollable
Original Rewritten
Frank Vahid, UC Rivers
ide
51/57
Synthesis-Oriented Coding Guidelines
Algorithmic specialization Use parallelizable hardware algorithms when possible
Hoisting and sinking of error checking Keep error checking out of loops to enable unrolling
Lookup table avoidance Use expressions rather than lookup tables
int clip[512] = { . . . }
void f() {
. . .
for (i=0; i < 10; i++)
val[i] = clip[val[i]];
. . .
}
void f() {
. . .
for (i=0; i < 10; i++)
if (val[i] > 255) val[i] = 255;
else if (val[i] < 0) val[i] = 0;
. . .
}
val[1]
<
0 255
3x1
>
val[1]
val[0]
<
0 255
3x1
>
val[0]
Original Rewritten
. . .
Comparisons can now be parallelized
Frank Vahid, UC Rivers
ide
52/57
Synthesis-Oriented Coding Guidelines
Use explicit control flow Replace function pointers with if statements
and static function calls
void (*funcArray[]) (char *data) = { func1, func2, . . . };
void f(char *data) {
. . .
funcPointer = funcArray[i];
(*funcPointer) (data);
. . .
}
void f(char *data) {
. . .
if (i == 0)
func1(data);
else if (i==1)
func2(data);
. . .
}
Original Rewritten
Frank Vahid, UC Rivers
ide
53/57
Coding Guideline Results on H.264
Simple coding guidelines made large improvement Rewritten software only ~3% slower than original
And, binary partitioning still competitive with C partitioning Speedups: Binary: 6.55, C: 6.56
Small difference caused by switch statements that used indirect jumps
0
1
2
3
4
5
6
7
8
9
101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Sp
eed
up
Ideal Speedup (Zero-time Hw Execution)
Speedup from C Partititioning
Speedup from Binary Partitioning
0
1
2
3
4
5
6
7
8
9
101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Sp
eed
up
Ideal Speedup (Zero-time Hw Execution)Speedup After Rewrite (C Partitioning)Speedup After Rewrite (Binary Partitioning)Speedup from C PartititioningSpeedup from Binary Partitioning
Frank Vahid, UC Rivers
ide
54/57
Coding Guideline Results on Other Benchmarks
Studied guidelines further on standard benchmarks Further synthesis speedups (again, independent of C vs. binary
issue) More guidelines to be developed As compute platforms incorporate FPGAs, might these
guidelines become mainstream?
573 1616 842
0123456789
10
g3fax mpeg2 jpeg brev fir crc
Sw
Hw/sw with original code
Hw/sw with guidelines
-88% -47%-30%
-20%
-10%
0%
10%
20%
30%
g3fa
x
mpe
g2
jpeg
brev fir crc
Performance Overhead
Size Overhead
Frank Vahid, UC Rivers
ide
55/57
Direction: New Applications – Image Processing
32x average speedup compared to uP with 10x faster clock Exploits parallelism in image processing
Window operations contain much fine-grained parallelism And, each pixel can be determined in parallel
Performance is memory-bandwidth limited Warp processing can output a pixel per cycle for each pixel that can be fetched
from memory per cycle Faster memory will further improve performance
0
10
20
30
40
50
60
Prewitt FIR
Wav
elet
Max
Blend
Antial
ias
Bright
en
Rober
ts
Sobel
Embo
ss
Sharp
en Blur
Gauss
ian
Burt-A
delso
n
Med
ian
Kuwah
ara
Avera
ge
Sp
ee
du
p
Frank Vahid, UC Rivers
ide
56/57
Direction: Applications with Process-Level Parallelism
Parallel code provides further speedup Average 79x speedup compared to desktop uP Use FPGA to implement 10s or 100s of processors
Can also exploit instruction-level parallelism Warp tools will have to detect coarse-grained parallelism
79.2200500
0
5
10
15
20
25
30
Sp
ee
du
p
Frank Vahid, UC Rivers
ide
57/57
Summary Showed feasibility of warp technology
Application kernels can be dynamically mapped to FPGA by reasonable amount of on-chip compute resources
Tremendous potential applicability Presently investigating
Embedded (w/ Freescale) Desktop (w/ Intel) Server (w/ IBM)
Radically-new FPGA apps may be possible Neural networks that rewire themselves? Network routers
whose queuing structure changes based on traffic patterns?
If the technology exists to synthesize circuits dynamically, what can we do with that technology?