Upload
mildred-carson
View
221
Download
0
Embed Size (px)
Citation preview
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Greg StittDepartment of Electrical and Computer Engineering
University of Florida
2/55
Introduction
Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell
phones, etc. Future architectures - Speech/image recognition, self-
guiding cars, computation biology, etc.
3/55
Introduction
FPGAs (Field Programmable Gate Arrays) – Implement custom circuits
10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05],
…
But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into
mainstream Make FPGAs “Invisible”
uPFPGA
Perf
orm
ance
FPGAs capable of large performance improvements
4/55
Introduction – Hardware/Software Partitioning
for (i=0; i < 128; i++) y[i] += c[i] * x[i]......
for (i=0; i < 16; i++) y[i] += c[i] * x[i]......
C Code for FIR Filter
Processor Processor
~1000 cycles
Compiler
0102030405060708090
100
Time Energy
Sw
Hardware/software partitioning selects performance critical regions for hardware implementation
[Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94]
Processor FPGA
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Designer creates custom hardware using hardware description language (HDL)
Hardware for loop
0102030405060708090
100
Time Energy
Hw/ SwSw
~ 10 cycles Speedup = 1000 cycles/ 10
cycles = 100x
5/55
Introduction – High-level Synthesis
Libraries/Object Code
Libraries/Object Code
Updated Binary
High-level Code
Decompilation
High-level Synthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
Problem: Describing circuit using HDL is time consuming/difficult
Solution: High-level synthesis Create circuit from high-
level code [Gupta, DeMicheli 92][Camposano,
Wolf 91][Rabaey 96][Gajski, Dutt 92]
Allows developers to use higher-level specification
Potentially, enables synthesis for software developers
DecompilationHw/Sw Partitioning
Compiler
6/55
Introduction – High-level Synthesis
Problem: Describing circuit using HDL is time consuming/difficult
Solution: High-level synthesis Create circuit from high-
level code [Gupta, DeMicheli 92][Camposano,
Wolf 91][Rabaey 96][Gajski, Dutt 92]
Allows developers to use higher-level specification
Potentially, enables synthesis for software developers
Libraries/Object Code
Libraries/Object Code
Updated Binary
High-level Code
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
DecompilationHigh-level Synthesis
7/55
Introduction – High-level Synthesis
Problem: Describing circuit using HDL is time consuming/difficult
Solution: High-level synthesis Create circuit from high-
level code [Gupta, DeMicheli 92][Camposano,
Wolf 91][Rabaey 96][Gajski, Dutt 92]
Allows developers to use higher-level specification
Potentially, enables synthesis for software developers
for (i=0; i < 16; i++) y[i] += c[i] * x[i]
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
DecompilationHigh-level Synthesis
8/55
Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis
Key techniques for synthesis from binaries Decompilation
Current and Future Directions Multi-threaded Warp Processing Custom Communication
9/55
Problems with High-Level Synthesis
Problem: High-level synthesis is unattractive to software developers
Requires specialized language
SystemC, NapaC, HandelC, …
Requires specialized compiler
Spark, ROCCC, CatapultC, …
Limited commercial success
Software developers reluctant to change tools
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
Non-Standard Software Tool Flow
Updated BinarySpecialized Language
DecompilationSpecialized Compiler
10/55
Warp Processing – “Invisible” Synthesis
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-Level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
Solution: Make synthesis “invisible” 2 Requirements
Standard software tool flow
Perform compilation before synthesis
Hide synthesis tool Move synthesis on
chip Similar to dynamic
binary translation [Transmeta]
But, translate to hw
DecompilationSynthesis
DecompilationCompiler
Updated BinaryHigh-level CodeLibraries/
Object Code
Libraries/Object Code
Updated BinarySoftware Binary
HardwareHardwareSoftwareSoftware
Move compilation before synthesis
Standard Software Tool Flow
11/55
Warp Processing – “Invisible” Synthesis
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-Level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftwareDecompilationSynthesis
DecompilationCompiler
Updated BinaryHigh-level CodeLibraries/
Object Code
Libraries/Object Code
Updated BinarySoftware Binary
HardwareHardwareSoftwareSoftware
Solution: Make synthesis “invisible” 2 Requirements
Standard software tool flow
Perform compilation before synthesis
Hide synthesis tool Move synthesis on
chip Similar to dynamic
binary translation [Transmeta]
But, translate to hw
Warp processor looks like standard uP but invisibly synthesizes hardware
12/55
Warp Processing – “Invisible” Synthesis
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-Level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftwareDecompilationSynthesis
DecompilationCompiler
Updated BinaryHigh-level CodeLibraries/
Object Code
Libraries/Object Code
Updated BinarySoftware Binary
HardwareHardwareSoftwareSoftware
Advantages Supports all
languages,compilers, IDEs
Supports synthesis of assembly code
Support synthesis of library code
Also, enables dynamic optimizations
Updated BinaryC, C++, Java, Matlab
Decompilationgcc, g++, javac, keil
Warp processor looks like standard uP but invisibly synthesizes hardware
13/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
Initially, software binary loaded into instruction memory
11
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary
14/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
ProfilerI Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryMicroprocessor executes
instructions in software binary
22
Time EnergyµP
15/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryProfiler monitors instructions
and detects critical regions in binary
33
Time Energy
Profiler
add
add
add
add
add
add
add
add
add
add
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Critical Loop Detected
16/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD reads in critical
region44
Time Energy
Profiler
On-chip CAD
17/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD converts critical region
into control data flow graph (CDFG)55
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
18/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD synthesizes
decompiled CDFG to a custom (parallel) circuit
66
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
19/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD maps circuit onto
FPGA77
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
20/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary88
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more
Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA
Ret reg4
FPGA
Time Energy
Software-only“Warped”
21/55
µP
Cache
Expandable Logic
RAM
Expandable RAM
uP
Performance
Profiler
µP
Cache
Warp Tools
DMA
FPGAFPGA
FPGA FPGA
RAM Expandable RAM – System detects RAM during start, improves performance invisibly
Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware.
Expandable Logic
22/55
Expandable Logic
Allows for customization of platforms User can select FPGAs based on used
applicationsApplication
Portable Gaming
Performance
Unacceptable Performance
23/55
Expandable Logic
Allows for customization of platforms User can select FPGAs based on used
applicationsApplication
Portable Gaming
Performance
. . . .
. . . .
•User can customize FPGAs to the desired amount of performance•Performance improvement is invisible – doesn’t require new binary from the developer
24/55
Expandable Logic
Allows for customization of platforms User can select FPGAs based on used
applicationsApplicationWeb Browser
Performance
Acceptable Performance
No-FPGA
•Platform designer doesn’t have to decide on fixed amount of FPGA.
•User doesn’t have to pay for FPGA that isn’t needed
25/55
uPI$
D$
FPGA
Profiler
On-chip CAD
Warp Processing Background: Basic Technology
Challenge: CAD tools normally require powerful workstations
Develop extremely efficient on-chip CAD tools
Requires efficient synthesis Requires specialized FPGA, physical
design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04],
University of Arizona
BinaryBinary
BinaryHW
Synthesis
Technology Mapping
Placement & Routing
Logic Optimization
BinaryUpdated Binary
JIT F
PG
A
com
pila
tio
n
26/55
Warp Processing Background: On-Chip CAD
60 MB
9.1 s
Xilinx ISE
Manually performed
3.6MB0.2 s
On-chip CAD
On a 75Mhz ARM7: only 1.4 s
46x improvement30% perf. penalty
Log.
Opt
.
Tech
. Map
Plac
e
Rou
te
RT
Syn.
Synt
hesi
s
27/55
Warp Processing: Initial Results - Embedded Applications
Average speedup of 6.3x Achieved completely transparently
Also, energy savings of 66%
0
3
6
9
12
15
brev
g3fa
x url
rocm
pktflo
wca
nrdr
bitm
np
tblo
ok
ttspr
k
mat
rix idct
g721
mpe
g2 fir
mat
mul
Avera
ge:
Med
ian:
Benchmarks
Sp
ee
du
p
28/55
Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis
Key techniques for synthesis from binaries Decompilation
Current and Future Directions Multi-threaded Warp Processing Custom Communication
29/55
Binary Synthesis Warp processors perform
synthesis from software binary – “binary synthesis”
Problem: No high-level information
Synthesis needs high-level constructs
> 10x slowdown
Can we recover high-level information for synthesis?
Make binary synthesis (and Warp processing) competitive with high-level synthesis
for (i=0; i < 128; i++) y[i] += c[i] * x[i]....
for (i=0; i < 128; i++) y[i] += c[i] * x[i]....
Compiler
Addi r1, r0, 0Ld r3, 256(r1)Ld r4, 512(r1)Subi r2, r1, 128Jnz r2, -5
No high-level constructs – arrays, loops, etc.
Binary Synthesis
Processor FPGAHardware can be > 10x to 100x
30/55
Decompilation We realized decompilation recovers
high-level information But, generally used for binary translation or
source-code recovery May not be suitable for synthesis
We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain
Determined relevant techniques Adapted existing techniques for synthesis
31/55
Decompilation – Control/Data Flow Graph Recovery
Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps
Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks
[Cifuentes 99, 00]
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Control/Data Flow Graph CreationOriginal C Code
Corresponding Assembly
32/55
Decompilation – Data Flow Analysis
Original purpose - remove temporary registers Area overhead – 130%
Need new techniques for binary synthesis
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Data Flow Analysis
33/55
Decompilation – Data Flow Analysis
Strength Reduction – Compare-with-zero instructions
Operator Size Reduction
Sub reg3, reg4, reg5 Bz reg3, -5
reg4 reg5
Sub
reg3
=
0
Branch?
Not needed, wastes area
32-bit reg4
32-bit +
32-bit reg5
32-bit reg3
Lb reg4, 0(reg1)Mvi reg5, 16Add reg3, reg4, reg5 8-bit +
8-bit reg3Only 8-bit adder needed
reg4
=
reg5
Branch?
Optimized DFG
Area Overhead Reduced to 10%
8-bit reg4 5-bit reg5
Optimized DFG
Load Byte 16
34/55
Decompilation – Function Recovery
Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}
Function Recovery
35/55
Decompilation – Control Structure Recovery
Recover loops, if statements Uses interval analysis techniques
[Cifuentes 94]
100% success rate
long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}
Control Structure Recovery
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
36/55
Decompilation – Array Recovery
Detect linear memory patterns and row-major ordering calculations
~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00]
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }
Array Recovery
37/55
Comparison of Decompiled Code and Original Code
Decompiled code almost identical to original code
Only difference is variable names Binary synthesis is competitive with high-level
synthesis
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }
Decompiled Code
Almost Identical Representations
38/55
Libraries/Object Code
Binary Synthesis Tool Flow
Binary Synthesis
BinaryBinary
DecompilationDecompilation
HardwareHardwareSoftwareSoftware
Libraries/Object Code
Hardware Netlists
Hardware Netlists
BitstreamBitstream
ProfilingSynthesisProfilingBinary Updater
Hw/Sw Estimation
Hw/Sw Estimation
Hw/Sw Partitioning
Hw/Sw Partitioning
ProfilingProfiling
Updated Binary
High-level Source
DecompilationCompiler
BinaryBinary
BitstreamBitstream
uP FPGA
Updated Binary
Updated Binary
Initially, high-level source is compiled and linked to form a binary
Recovers high-level information needed for synthesis
Modifies binary to use synthesized hardware
~30,000 lines of C code
39/55
0123456789
101112131415
Sp
eed
up
FIR Fi
lter
Beam
form
er
Vite
rbi
Brev
Url
BITMNP0
1
IDCTR
N01
PNTR
CH01
Aver
age
High-level
Binary-level
Binary Synthesis is Competitive with High-Level Synthesis
Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better
Commercial products beginning to appear Critical Blue, Binachip
Small difference in speedup
40/55
Binary Synthesis with Software Compiler Optimizations
But, binaries generated with few optimizations Optimizations for software may hurt
hardware Need new decompilation techniquesC code
SW Compiler
Optimized Binary
uP FPGA
Binary Synthesis
Binary is optimized for software
Hardware synthesized from optimized binary may be inefficient
41/55
Loop Rerolling
Solution: We introduce loop rerolling to undo loop unrolling
Problem: Loop unrolling may cause inefficient hardware
Longer synthesis times Super-linear heuristics Unrolling 100 times =>
synthesis time is 1002 times longer
Larger area requirements Unrolling by compiler unlikely
to match unrolling by synthesis Loop structure needed for
advanced synthesis techniques
Non-unrolled Unrolled
Synthesis Execution Times
Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5
Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1
Non-unrolled Loop Unrolled Loop
42/55
Loop Rerolling – Identifying Unrolled Loops
x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;
Original C Code
Find Consecutive Repeating Substrings: Adjacent Nodes with Same SubstringUnrolled Loop
2 unrolled iterationsEach iteration = abc (Ld, Add, St)
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Binary
x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;
Unrolled Loop
Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D
Map to String
BABCABCD
String Representatio
n
Idea - Identify consecutively repeating instruction sequences
abc c db
abcabcd c abcd d abcd d
dabcd
Suffix Tree
[Ukkonen 95]
43/55
Loop Rerolling
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Original C Code
Unrolled Loop Identificiation
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Determine relationship of constants
1)
Add r3, r3, 1i=0loop:Ld r0, b(i)Add r1, r0, 1St a(i), r1Bne i, 2, loopMov r4, r3
Replace constants with induction variable expression
2)
reg3 = reg3 + 1;for (i=0; i < 2; i++) array1[i]=array2[i]+1;reg4=reg3;
Rerolled, decompiled code
3)
x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;
Average Speedup of 1.6x
44/55
Strength Promotion
+
++
<< <<
B[i+1] 4B[i+1] 1
+
<< <<
B[i] 3 B[i] 1
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
However, some of the strength reduction was beneficial
Strength promotion lets synthesis decide on strength reduction, not software compiler
Average Speedup of 1.5
Identify strength-reduced subgraphs
+
++
<< <<
B[i+1] 4B[i+1] 1
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
Replace with multiplication
++
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
++
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*
++
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*
B[i] 66
*
1
++
B[i+1] 18B[i] 10
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]
+
A[i]
* *
Synthesis reapplies strength reduction to get optimal DFG
Problem: Strength reduction may cause inefficient hardware
45/55
Multiple ISA/Optimization Results
What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible
What about different instructions sets? Side effects may degrade hardware performance
0
5
10
15
20
25
30
Speedups similar on MIPS for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar on ARM for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar between ARM and MIPS
Complex instructions of ARM didn’t hurt synthesis
MicroBlaze speedups much larger
MicroBlaze is a slower microprocessor
-O3 optimizations were very beneficial to hardware
0
5
10
15
20
25
30
MIP
S -O1
MIP
S -O3
ARM -O
1
ARM -O
3
Micr
oBlaz
e -O
1
Micr
oBlaz
e -O
3
Sp
eed
up
46/55
High-level vs. Binary Synthesis: Proprietary H.264 Decoder
MPEG2 H.264
High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor
H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality
47/55
High-level vs. Binary Synthesis: Proprietary H.264 Decoder
Binary synthesis was competitive with high-level synthesis High-level speedup – 6.56x Binary speedup – 6.55x
0
1
2
3
4
5
6
7
8
9
101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Sp
eed
up
Speedup (High-level)
Speedup (Binary)
Binary synthesis competitive with high- level synthesis
48/55
Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis
Key techniques for synthesis from binaries Decompilation
Current and Future Directions Multi-Threaded Warp Processing Custom Communication
49/55
Thread Warping - Overview
Profiler
µP
Warp Tools
Warp FPGA
µP
µP µPOS
a( ) b( )
b( )
for (i=0; i < 10; i++) createThread( b );
Function a( )
OS
Thread Queue
b( ) b( ) b( ) b( )b( ) b( )b( )b( )
Warp Toolsb( )
Warp FPGA
b( )
b( )
b( )
b( )b( )
b( ) b( )
b( )
OS can only schedule 2 threads
Remaining 8 threads placed in thread queue
Warp tools create custom accelerators for b( )
OS schedules 4 threads to custom accelerators
3x more thread parallelism
Architectural Trend – Include more cores on chip
Result – More multi-threaded applications
50/55
Thread Warping - Overview
Profiler
µP
Warp Tools
Warp FPGA
µP
µP µPOS
a( ) b( )
b( )
for (i=0; i < 10; i++) createThread( b );
Function a( )
Warp Toolsb( )
Profiler
Profiler detects performance critical loop in b( )
Warp FPGA
b( )
b( )
b( )
b( ) Warp tools create larger/faster accelerators
b( )b( ) b( )b( )
Potentially > 100x speedup
51/55
130 502 63 130 38308
01020304050
Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean
4-uP
TW
8-uP
16-uP
32-uP
64-uP
Thread Warping - ResultsThread warping 120x faster than 4-uP (ARM) system
Comparison of thread warping (TW) and multi-core
Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA
52/55
Warp Processing – Custom Communication
µP µP
µP µP
Problem: Best topology is application dependent
Bus Mesh
Bus Mesh
App1
App2
NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]
Perf
orm
ance
Perf
orm
ance
53/55
Warp Processing – Custom Communication
FPGA
NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]
Problem: Best topology is application dependent
Bus Mesh
Bus Mesh
App1
App2
µP µP
µP µP
Warp processing can dynamically choose topology – 2x to 100x improvement
FPGA
µP µP
µP µP
FPGA
µP µP
µP µP
Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing”
Perf
orm
ance
Perf
orm
ance
54/55
Summary
uPI$
D$
FPGA
Profiler
On-chip CAD
Updated BinaryAny Language
Updated BinaryStandard Binary
DecompilationAny Compiler
Developer is unaware of FPGA/synthesis
BinaryBinary
BinaryHW
Binary Synthesis
JIT FPGA Compilation
BinaryUpdated Binary
Decompilation makes possible
FPGA
Expandable Logic
Warp Processing
uP
Performance
Warp processing invisibly achieves > 100x speedups
55/55
References Patent
Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004
1. Hardware/Software Partitioning of Software Binaries G. Stitt and F. VahidIEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164- 170.
2. Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659-681.
3. Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES)
4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007.
5. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547-554.
6. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285-290.
7. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp.396-397.
8. Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250-255.
Supported by NSF, SRC, Intel, IBM, Xilinx