Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits

Frank Vahid Professor

Department of Computer Science and EngineeringUniversity of California, Riverside

Associate Director, Center for Embedded Computer Systems, UC Irvine

Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, Motorola/Freescale

Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), Kris Miller (MS 2007), David Sheldon (3rd yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy (1st

yr PhD)

Frank Vahid, UC Rivers

ide

2/57

Outline FPGAs

Overview Hard to program --> Binary-level

partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary


ide

3/57

FPGAs FPGA -- Field-Programmable Gate Array

Off-the-shelf chip, evolved in early 1990s

Implements custom circuit just by downloading stream of bits (“software”)

Basic idea: N-address memory can implement N-input combinational logic

(Note: no “gate array” inside) Memory called Lookup Table, or LUT

FPGA “fabric” Thousands of small (~3-input) LUTs –

larger LUTs are inefficient Thousands of switch matrices (SM) for

programming interconnections Possibly additional hard core

components, like multipliers, RAM, etc. CAD tools automatically map desired

circuit onto FPGA fabric

*

+

*

“Lookup table” -- LUT

a b

F G

a1

a0

4x2 Memory

ab

1010

1110

d1 d0

F G

00011011

Implement particular circuit just by

downloading particular bits

LUT

SM

LUT

SM

LUT

LUT LUT LUT

*


ide

4/57

FPGAs: "Programmable" like Microprocessors -- Download Bits

Processor Processor

001010010……

001010010……

0010…

Bits loaded into program memory

Microprocessor Binaries

FPGA0010

…SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

SM

SM

SM

SM

CLB

01

11

11 01

001111...

10 11

Configurable logic block

(CLB) -- (LUT plus flip-flops)

addra

bc

x y

00010111

01101001

1

0

ba

cx

y

SM (Switch Matrix)

a

b

a or b

a or b

000101...

001010010……

01110100...

Bits loaded into LUTs, CLBs, and SMs

FPGA Binaries


ide

5/57

FPGAs as Coprocessors

Coprocessor -- Accelarates application kernel by implementing as circuit

ASIC coprocessor known to speedup many application kernels

Energy advantages too (e.g., Henkel’98, Rabaey’98, Stitt/Vahid’04)

FPGA coprocessor also gives speedup/energy benefits (Stitt/Vahid IEEE D&T’02, IEEE TECS’04)

Con: more silicon (~20x), ~4x performance overhead (Rose FPGA'06)

Pro: platform fully programmable Shorter time-to-market, smaller non-

recurring engineering (NRE) cost, low cost devices available, late changes (even in-product)

ASICProc.

Application

FPGAProc.

Application


ide

6/57

FPGAs as Coprocessors Surprisingly Competitive to ASIC

FPGA 34% energy savings versus ASIC’s 48% (Stitt/Vahid IEEE D&T’02, IEEE TECS’04)

A jet isn’t as fast as a rocket, but it sure beats driving

0%

20%

40%

60%

80%

100%

120%

% E

nerg

y v

s S

w O

nly

ASIC

FPGA


ide

7/57

FPGA – Why (Sometimes) Better than Microprocessor

x = (x >>16) | (x <<16);x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00);x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0);x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc);x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa);

C Code for Bit Reversal

Hardware for Bit Reversal

Bit Reversed X Value

Bit Reversed X ValueBit Reversed X Value

. . . . . . . . . . .

. . . . . . . . . . .

Original X Value

ProcessorFPGA

Requires only 1 cycle (speedup of 32x to 128x)

sll $v1[3],$v0[2],0x10srl $v0[2],$v0[2],0x10or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x8and $v1[3],$v1[3],$t5[13]sll $v0[2],$v0[2],0x8and $v0[2],$v0[2],$t4[12]or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x4and $v1[3],$v1[3],$t3[11]sll $v0[2],$v0[2],0x4and $v0[2],$v0[2],$t2[10]...

Binary

Compilation

ProcessorProcessor

Requires between 32 and 128 cycles

In general, because of concurrency, from bit-level to task level


ide

8/57

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

FPGAs: Why (Sometimes) Better than Microprocessor

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

C Code for FIR Filter

Processor Processor

1000’s of instructions Several thousand cycles

Hardware for FIR Filter

Processor FPGA

~ 7 cycles Speedup > 100x


ide

9/57

FPGAs are Hard to Program Synthesis from hardware

description languages (HDLs) VHDL, Verilog Great for parallelism But non-standard languages,

manual partitioning SystemC a good step

C/C++ partitioning compilers Use language subset Growing in importance But special compiler limits

adoption

BinaryApplic.

ProfilingSpecial Compiler

BinaryBinary

NetlistFPGA BinaryMicropr. Binary

Includessynthesis, tech. map,pace & route

FPGAProc.100 software writers for every CAD user

Only about 15,000 CAD seats worldwide; millions of compiler seats


ide

10/57

Binary-Level Partitioning Helps Binary-level partitioning

Stitt/Vahid, ICCAD’02 Recent commercial product: Critical

Blue [www.criticalblue.com] Partition and synthesize starting

from SW binary Advantages

Any compiler, any language, multiple sources, assembly/object support, legacy code support

Better incorporation into toolflow

Disadvantage Quality loss due to lack of high-

level language constructs? (More later)

BinarySW

ProfilingStandard Compiler

BinaryBinary

Binary-level

Partitioner

NetlistNetlistModified Binary

Traditionalpartitioningdone here

Less disruptive,back-end tool

Includessynthesis, tech. map,place & route

FPGAProc.


ide

11/57

Outline FPGAs




ide

12/57

Warp Processing Observation: Dynamic binary recompilation

to a different microprocessor architecture is a mature commercial technology e.g., Modern Pentiums translate x86 to VLIW

Question: If we can recompile binaries to FPGA circuits, can we

dynamically recompile binaries to FPGA circuits?


ide

13/57

µP

FPGAOn-chip CAD

Warp Processing Idea

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary


ide

14/57

µP

FPGAOn-chip CAD


ProfilerI Mem

D$


Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP


ide

15/57

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected


ide

16/57

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD


ide

17/57

µP

FPGADynamic Part. Module (DPM)


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0


ide

18/57

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .


ide

19/57

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

FPGA

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++


ide

20/57

µP



Profiler

µP

I Mem

D$


Software Binary88

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0

+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”


ide

21/57

µP

FPGAOn-chip CAD


ProfilerI Mem

D$µP

Likely multiple microprocessors per chip, serviced by one on-chip CAD block

µPµPµPµPµP


ide

22/57

Warp Processing: Trend Towards Processor/FPGA Programmable Platforms

FPGAs with hard core processors

FPGAs with soft core processors

Computer boards with FPGAs

Cray XD1. Source: FPGA journal, Apr’05

Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera

Xilinx Spartan. Source: Xilinx


ide

23/57

Warp Processing: Trend Towards Processor/FPGA Programmable Platforms

Programming a key challenge Soln 1: Compile high-level

language to custom binaries using both microprocessor and FPGA

Soln 2: Use standard microprocessor binaries, dynamically re-compile (warp)

Cons: Less high-level information when

compiling, less optimization Pros:

Available to all software developers, not just specialists

Data dependent optimization Most importantly, standard

binaries enable “ecosystem” among tools, architecture, and applications

Architectures

Applications Tools

Standard binaries

Standard binary (and ecosystem) concept presently absent in FPGAs and other new

programmable platforms

BinarySW

ProfilingStandard Compiler

BinaryBinary

ProfilingCAD Tools

Traditionalpartitioningdone here

FPGAProc. FPGAProc. FPGAProc.

ProfilingCAD Tools

FPGAProc.

ProfilingCAD Tools


ide

24/57

Outline FPGAs


partitioning Warp processing Techniques underlying warp

processing Overall warp processing results Directions and Summary


ide

25/57

µPI$

D$

FPGA

Profiler

On-chip CAD

Warp Processing Steps (On-Chip CAD)

Technology mapping,

placement, and routing

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis

Profiling & partitioning

Binary Updater

BinaryMicropr. Binary

Std. HW Binary

JIT FPGA compilation


ide

26/57

Warp Processing – Profiling and Partitioning

Applications spend much time in small amount of code

90-10 rule Observed 75-4 rule for

MediaBench, NetBench Developed efficient

hardware profiler Gordon-Ross/Vahid, CASES'04, IEEE

Trans. on Comp 06 Partitioning straightforward

Try most critical code first 0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 10

% execution time

% size of program

µPI$D$

FPGA

Profiler

On-chip CAD

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary



ide

27/57

Warp Processing – Decompilation Synthesis from binary has a key challenge

High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – huge overheads Need to recover high-level information

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary


0

1

2

3

4

5

6

7

8

SpeedupEnergy Size

g3faxadpcmcrcdesenginejpegsumminv42avg

Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs.

microprocessor alone


ide

28/57

Warp Processing – Decompilation Solution –Recover high-level information from

binary: decompilation Extensive previous work (for different purposes)

Adapted Developed new decompilation methods also


long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Control/Data Flow Graph CreationOriginal C Code

Corresponding Assembly


ret reg4

reg3 := 0reg4 := 0

Data Flow Analysis

long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}

Function Recovery

long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}

Control Structure Recovery

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Array Recovery

Almost Identical Representations

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary



ide

29/57

New Decompilation Method: Loop Rerolling

Problem: Compiler unrolling of loops (to expose parallelism) causes synthesis problems:

Huge input (slow), can’t unroll to desired amount, can’t use advanced loop methods (loop pipelining, fusion, splitting, ...)

Solution: New decompilation method: Loop Rerolling

Identify unrolled iterations, compact into one iteration

for (int i=0; i < 3; i++) accum += a[i];

Ld reg2, 100(0)Add reg1, reg1, reg2 Ld reg2, 100(1)Add reg1, reg1, reg2Ld reg2, 100(2)Add reg1, reg1, reg2

Loop Unrolling for (int i=0; i<3;i++)

reg1 += array[i];

Loop Rerolling


ide

30/57

Loop Rerolling: Identify Unrolled Iterations

x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;

Original C Code

Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring

Unrolled Loop

2 unrolled iterationsEach iteration = abc (Ld, Add, St)

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Binary

x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;

Unrolled Loop

Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D

Map to String

BABCABCD

String Representatio

n

Find consecutively repeating instruction sequences

abc c db

abcabcd c abcd d abcd d

dabcd

Suffix Tree

Derived from bioinformatics

techniques


ide

31/57

Warp Processing – Decompilation

Study Synthesis after decompilation often quite

similar Almost identical performance, small area overhead

Example Cycles ClkFrq Time Area Cycles ClkFrq Time Area %TimeOverhead %AreaOverhead

bit_correlator 258 118 2.19 15 258 118 2.186 15 0% 0%fir 129 125 1.03 359 129 125 1.032 371 0% 3%udiv8 281 190 1.48 398 281 190 1.479 398 0% 0%prewitt 64516 123 525 2690 64516 123 524.5 4250 0% 58%mf9 258 57 4.5 1048 258 57 4.503 1048 0% 0%moravec 195072 66 2951 680 195072 70 2791 676 -6% -1%

Avg: -1% 10%

Synthesis from C Code Synthesis after Decompiling Binary

FPGA 2005


ide

32/57

2. Deriving high-level constructs from binaries

Recent study of decompilation robustness

In presence of compiler optimizations, and instruction sets

Energy savings of 77%/76%/87% for MIPS/ARM/Microblaze

Example Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw S Sw Hw/Sw SFIR Filter 1.000 0.089 11.2 0.923 0.070 14.2 1.000 0.085 11.8 0.999 0.084 11.9 1.000 0.040 25.3 0.549 0.015 68.4Beamformer 1.000 0.074 13.5 0.853 0.071 14.0 1.000 0.149 6.7 1.018 0.172 5.8 1.000 0.031 32.3 0.647 0.032 31.4Viterbi 1.000 0.136 7.4 0.891 0.152 6.6 1.000 0.131 7.6 0.957 0.126 7.9 1.000 0.060 16.7 0.765 0.017 59.0Crc 1.000 0.030 33.8 0.967 0.019 53.6 1.000 0.020 49.5 1.105 0.007 134.8 1.000 0.012 80.3 0.995 0.011 88.6Des 1.000 0.275 3.6 0.990 0.310 3.2 1.000 0.360 2.8 1.028 0.401 2.5 1.000 0.205 4.9 0.998 0.218 4.6Summin 1.000 0.111 9.0 0.899 0.145 6.9 1.000 0.183 5.5 0.684 0.128 7.8 n/a n/a n/a n/a n/a n/aBrev 1.000 0.120 8.3 0.976 0.129 7.7 1.000 0.156 6.4 1.476 0.153 6.5 1.000 0.011 90.2 0.951 0.009 106.5BITMNP01 1.000 0.114 8.8 0.985 0.113 8.8 1.000 0.188 5.3 0.988 0.186 5.4 1.000 0.112 8.9 0.999 0.115 8.7IDCTRN01 1.000 0.323 3.1 0.975 0.323 3.1 1.000 0.230 4.4 1.005 0.230 4.3 1.000 0.258 3.9 0.885 0.150 6.7PNTRCH01 1.000 0.196 5.1 0.945 0.196 5.1 1.000 0.325 3.1 0.963 0.313 3.2 n/a n/a n/a n/a n/a n/a

Average: 1.000 0.147 10.4 0.940 0.153 12.3 1.000 0.183 10.3 1.022 0.180 19.0 1.000 0.091 32.8 0.849 0.071 46.7Geo.Mean:1.000 0.124 8.4 0.939 0.122 8.7 1.000 0.150 7.0 1.008 0.134 8.3 1.000 0.053 19.0 0.831 0.037 27.4

O1 O3MIPS

O1 O1 O3ARM MIcroBlaze

O3

ICCAD’05DATE’04


ide

33/57

Decompilation is Effective Even with High Compiler-Optimization Levels

Average Speedup of 10 Examples

0

5

10

15

20

25

30

Speedups similar on MIPS for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar on ARM for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar between ARM and MIPS

Complex instructions of ARM didn’t hurt synthesis

MicroBlaze speedups much larger

MicroBlaze is a slower microprocessor

-O3 optimizations were very beneficial to hardware

0

5

10

15

20

25

30

MIP

S -O1

MIP

S -O3

ARM -O

1

ARM -O

3

Micr

oBlaz

e -O

1

Micr

oBlaz

e -O

3

Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.


ide

34/57

Decompilation Effectiveness In-Depth Study

Performed in-depth study with Freescale

H.264 video decoder Highly-optimized proprietary

code, not reference code Huge difference

Research question: Is synthesis from binaries competitive on highly-optimized code?

Several-month study

MPEG 2 H.264: Better quality, or smaller files, using more

computation


ide

35/57

Optimized H.264

Larger than most benchmarks

H.264: 16,000 lines Previous work: 100 to

several thousand lines Highly-optimized

H.264: Many man-hours of manual optimization

10x faster than reference code used in previous works

Different profiling results Previous examples

~90% time in several loops H.264

~90% time in ~45 functions Harder to speedup

Function Name Instr %TimeCumulative SpeedupMotionComp_00 33 6.8% 1.1InvTransform4x4 63 12.5% 1.1FindHorizontalBS 47 16.7% 1.2GetBits 51 20.8% 1.3FindVerticalBS 44 24.7% 1.3MotionCompChromaFullXFullY24 28.6% 1.4FilterHorizontalLuma 557 32.5% 1.5FilterVerticalLuma 481 35.8% 1.6FilterHorizontalChroma133 39.0% 1.6CombineCoefsZerosInvQuantScan69 42.0% 1.7memset 20 44.9% 1.8MotionCompensate 167 47.7% 1.9FilterVerticalChroma 121 50.3% 2.0MotionCompChromaFracXFracY48 53.0% 2.1ReadLeadingZerosAndOne56 55.6% 2.3DecodeCoeffTokenNormal93 57.5% 2.4DeblockingFilterLumaRow272 59.4% 2.5DecodeZeros 79 61.3% 2.6MotionComp_23 279 63.0% 2.7DecodeBlockCoefLevels56 64.6% 2.8MotionComp_21 281 66.2% 3.0FindBoundaryStrengthPMB44 67.7% 3.1


ide

36/57

C vs. Binary Synthesis on Opt. H.264

Binary partitioning competitive with source partitioning

Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information

needed for partitioning and synthesis

Speedup from C Partititioning

0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

Number of Functions in Hardware

Sp

ee

du

p


0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51


Sp

ee

du

p


Speedup from Binary Partitioning


ide

37/57

Warp Processing – Synthesis

ROCM - Riverside On-Chip Minimizer

Standard register-transfer synthesis

Logic synthesis – make it lean Combination of approaches from

Espresso-II [Brayton, et al., 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979]

Cost/benefit analysis of operations

Result Single expand phase instead of

multiple iterations Eliminate need to compute off-set

– reduces memory usage On average only 2% larger than

optimal solution

µPI$D$

FPGA

Profiler

On-chip CAD

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary


Expand

Reduce

Irredundant

dc-seton-set off-set


ide

38/57

Warp Processing – JIT FPGA Compilation

Hard – Routing is extremely compute/memory intensive

Solution – Jointly design CAD and FPGA architecture

Cost/benefit analysis Highly iterative process

µPI$D$

FPGA

Profiler

On-chip CAD

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary



ide

39/57

Warp-Targeted FPGA Architecture CAD-specialized configurable

logic fabric Simplified switch matrices

Directly connected to adjacent CLB

All nets are routed using only a single pair of channels

Allows for efficient routing Routing is by far the most time-

consuming on-chip CAD task Simplified CLBs

Two 3 input, 2 output LUTs Each CLB connected to adjacent

CLB to simplify routing of carry chains

Currently being prototyped by Intel (scheduled for 2006 Q3 shuttle)

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB

DATE’04

µPI$

D$

FPGA

Profiler

On-chip CAD


ide

40/57

Warp Processing – Technology Mapping

Dynamic Hardware/Software Partitioning: A First Approach, DAC’03A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

ROCTM - Technology Mapping/Packing Decompose hardware circuit into DAG

Nodes correspond to basic 2-input logic gates (AND, OR, XOR, etc.)

Hierarchical bottom-up graph clustering algorithm Breadth-first traversal combining nodes to form single-output

LUTs Combine LUTs with common inputs to form final 2-output LUTs Pack LUTs in which output from one LUT is input to second LUT

JIT FPGA Compilation

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary



ide

41/57

Warp Processing – Placement ROCPLACE - Placement

Dependency-based positional placement algorithm Identify critical path, placing critical nodes in center of CLF Use dependencies between remaining CLBs to determine

placement Attempt to use adjacent CLB routing whenever possible

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB

CLB

Dynamic Hardware/Software Partitioning: A First Approach, DAC’03A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04



Placement

Logic Synthesis

Routing

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary



ide

42/57

ROCR - Riverside On-Chip Router Requires much less memory than VPR as resource graph is smaller

10x faster execution time than VPR (Timing driven)

Produces circuits with critical path 10% shorter than VPR (Routablilty driven)

Warp Processing – Routing

Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04

0

10000

20000

30000

40000

50000

60000

70000

Benchmark

Me

mo

ry

Us

ag

e (

KB

)

VPR (RD) VPR (TD) ROCR

0

10

20

30

40

50

60

Benchmark

Ex

ec

uti

on

Tim

e (

s)

VPR (TD) ROCR

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary




Placement

Logic Synthesis

Routing


ide

43/57

Outline FPGAs




ide

44/57

Experiments with Warp Processing

Warp Processor ARM/MIPS plus our fabric Riverside on-chip CAD tools to

map critical region to configurable fabric

Requires less than 2 seconds on lean embedded processor to perform synthesis and JIT FPGA compilation

Traditional HW/SW Partitioning ARM/MIPS plus Xilinx Virtex-E

FPGA Manually partitioned software

using VHDL VHDL synthesized using Xilinx

ISE 4.1

ARMI$

D$

FPGA

Profiler

On-Chip CAD

ARMI$

D$

Xilinx Virtex-E FPGA


ide

45/57

191 113 130

0

10

20

30

40

50

60

70

80

Spee

dup

Warp Proc.

Xilinx Virtex-E

Warp ProcessorsPerformance Speedup (Most Frequent Kernel Only)

Average kernel speedup of 41, vs. 21 for Virtex-E

SW Only Execution

WCLA simplicity results in faster HW

circuits


ide

46/57

0

2

4

6

8

10

12

14

16

18

Spee

dup

Warp Proc.

Warp ProcessorsPerformance Speedup (Overall, Multiple Kernels)

Average speedup of 7.4 Energy reduction of 38% - 94%

SW Only Execution

Assuming 100 MHz ARM, and fabric clocked at rate determined by

synthesis


ide

47/57

Warp Processors - ResultsExecution Time and Memory Requirements

60 MB

9.1 s

Xilinx ISE

3.6MB1.4s

DPM (CAD) (75MHz ARM7)

3.6MB0.2 s

DPM (CAD)


ide

48/57

Outline FPGAs




ide

49/57

Direction: Coding Guidelines for Partitioning?

In-depth H264 study led to a question: Why aren’t speedups (from binary or C) closer to “ideal” (0-time per fct)

We thus examined dozens of benchmarks in more detail Are there simple coding guidelines that result in better

speedups when kernels are synthesized to circuits?


0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51


Sp

ee

du

p


0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51


Sp

ee

du

p



0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51


Sp

ee

du

p

Ideal Speedup (Zero-time Hw Execution)




ide

50/57

Synthesis-Oriented Coding Guidelines

Pass by value-return Declare a local array and copy in all data needed by a

function (makes lack of aliases explicit) Function specialization

Create function version having frequent parameter-values as constants

void f(int width, int height ) {

. . . .

for (i=0; i < width, i++)

for (j=0; j < height; j++)

. . .

. . .

}

void f_4_4() {

. . . .

for (i=0; i < 4, i++)

for (j=0; j < 4; j++)

. . .

. . .

}

Bounds are explicit so loops are now unrollable

Original Rewritten


ide

51/57


Algorithmic specialization Use parallelizable hardware algorithms when possible

Hoisting and sinking of error checking Keep error checking out of loops to enable unrolling

Lookup table avoidance Use expressions rather than lookup tables

int clip[512] = { . . . }

void f() {

. . .

for (i=0; i < 10; i++)

val[i] = clip[val[i]];

. . .

}

void f() {

. . .

for (i=0; i < 10; i++)

if (val[i] > 255) val[i] = 255;

else if (val[i] < 0) val[i] = 0;

. . .

}

val[1]

<

0 255

3x1

>

val[1]

val[0]

<

0 255

3x1

>

val[0]

Original Rewritten

. . .

Comparisons can now be parallelized


ide

52/57


Use explicit control flow Replace function pointers with if statements

and static function calls

void (*funcArray[]) (char *data) = { func1, func2, . . . };

void f(char *data) {

. . .

funcPointer = funcArray[i];

(*funcPointer) (data);

. . .

}

void f(char *data) {

. . .

if (i == 0)

func1(data);

else if (i==1)

func2(data);

. . .

}

Original Rewritten


ide

53/57

Coding Guideline Results on H.264

Simple coding guidelines made large improvement Rewritten software only ~3% slower than original

And, binary partitioning still competitive with C partitioning Speedups: Binary: 6.55, C: 6.56

Small difference caused by switch statements that used indirect jumps

0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51


Sp

eed

up

Ideal Speedup (Zero-time Hw Execution)



0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51


Sp

eed

up

Ideal Speedup (Zero-time Hw Execution)Speedup After Rewrite (C Partitioning)Speedup After Rewrite (Binary Partitioning)Speedup from C PartititioningSpeedup from Binary Partitioning


ide

54/57

Coding Guideline Results on Other Benchmarks

Studied guidelines further on standard benchmarks Further synthesis speedups (again, independent of C vs. binary

issue) More guidelines to be developed As compute platforms incorporate FPGAs, might these

guidelines become mainstream?

573 1616 842

0123456789

10

g3fax mpeg2 jpeg brev fir crc

Sw

Hw/sw with original code

Hw/sw with guidelines

-88% -47%-30%

-20%

-10%

0%

10%

20%

30%

g3fa

x

mpe

g2

jpeg

brev fir crc

Performance Overhead

Size Overhead


ide

55/57

Direction: New Applications – Image Processing

32x average speedup compared to uP with 10x faster clock Exploits parallelism in image processing

Window operations contain much fine-grained parallelism And, each pixel can be determined in parallel

Performance is memory-bandwidth limited Warp processing can output a pixel per cycle for each pixel that can be fetched

from memory per cycle Faster memory will further improve performance

0

10

20

30

40

50

60

Prewitt FIR

Wav

elet

Max

Blend

Antial

ias

Bright

en

Rober

ts

Sobel

Embo

ss

Sharp

en Blur

Gauss

ian

Burt-A

delso

n

Med

ian

Kuwah

ara

Avera

ge

Sp

ee

du

p


ide

56/57

Direction: Applications with Process-Level Parallelism

Parallel code provides further speedup Average 79x speedup compared to desktop uP Use FPGA to implement 10s or 100s of processors

Can also exploit instruction-level parallelism Warp tools will have to detect coarse-grained parallelism

79.2200500

0

5

10

15

20

25

30

Sp

ee

du

p


ide

57/57

Summary Showed feasibility of warp technology

Application kernels can be dynamically mapped to FPGA by reasonable amount of on-chip compute resources

Tremendous potential applicability Presently investigating

Embedded (w/ Freescale) Desktop (w/ Intel) Server (w/ IBM)

Radically-new FPGA apps may be possible Neural networks that rewire themselves? Network routers

whose queuing structure changes based on traffic patterns?

If the technology exists to synthesize circuits dynamically, what can we do with that technology?

Documents

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University