60
Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine

Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ

  • View
    224

  • Download
    2

Embed Size (px)

Citation preview

Portability for FPGA Applications—Warp Processing and SystemC Bytecode

Contributing Ph.D. StudentsRoman Lysecky (Ph.D. 2005, now Asst. Prof. at

Univ. of ArizonaGreg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of

Florida, GainesvilleScotty Sirowy (current)David Sheldon (current)Chen Huang (current)

This research was supported in part by the National Science Foundation, the Semiconductor Research

Corporation, Intel, Freescale, IBM, and Xilinx

Frank VahidDept. of CS&E

University of California, Riverside

Associate Director, Center for Embedded Computer Systems, UC Irvine

2/64Frank Vahid, UC Riverside

Portable Applications on PCs

x86 binary

PentiumAtomOpteron Dual

Core

How? Why?

One binary

Multiple platforms

3/64Frank Vahid, UC Riverside

Portable Applications on PCs

Standard software binary

Dynamic software binary translation

Applications

Tools Architectures

“Ecosystem”

SW binary translation

VLIWx86 µP

VLIWBinary

x86Binary

4/64Frank Vahid, UC Riverside

Meanwhile, Circuits on FPGAs Show Large Speedups

Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, RAW, …

0

10

20

30

40

50

60

Sp

ee

du

p

79.2200500

0

5

10

15

20

25

30

Sp

ee

du

p

5/64Frank Vahid, UC Riverside

FPGAs Entering Computing Mainstream

Xilinx Virtex II Pro. Source: XilinxSGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs)

AMD Opteron Intel QuickAssist Cray, SGI Mitrionics IBM Cell (research) Xilinx, Altera

6/64Frank Vahid, UC Riverside

Circuits on FPGAs are Software Binaries

Processor Processor

001010010……

001010010……

0010…

Bits loaded into program memory

Microprocessor Binaries (Instructions)

001010010……

01110100...

Bits loaded into LUTs and SMs

FPGA “Binaries” (Circuits)

Processor FPGA0111

aka "bitstream"

"Software"

"Hardware"

Sep 2007 IEEE Computer

not hardware

7/64Frank Vahid, UC Riverside

“Portable Applications” + “FPGAs” Standard software

binary Dynamic translation

Applications

Tools Architectures

“Ecosystem”

SW binary translation

VLIWx86 µP

VLIWBinary

x86Binary

x86 VLIW DSP FPGA

Speedup

SW binary translation

FPGAx86 µP

FPGA binary

“Warp Processing”

8/64Frank Vahid, UC Riverside

µP

FPGAOn-chip CAD

Warp Processing

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary

9/64Frank Vahid, UC Riverside

µP

FPGAOn-chip CAD

Warp Processing

ProfilerI Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP

10/64Frank Vahid, UC Riverside

µP

FPGAOn-chip CAD

Warp Processing

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected

11/64Frank Vahid, UC Riverside

µP

FPGAOn-chip CAD

Warp Processing

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD

12/64Frank Vahid, UC Riverside

µP

FPGADynamic Part. Module (DPM)

Warp Processing

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

13/64Frank Vahid, UC Riverside

µP

FPGADynamic Part. Module (DPM)

Warp Processing

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

14/64Frank Vahid, UC Riverside

µP

FPGADynamic Part. Module (DPM)

Warp Processing

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

15/64Frank Vahid, UC Riverside

µP

FPGADynamic Part. Module (DPM)

Warp Processing

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary88

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”

>10x speedups for some apps

Warp speed, Scotty

16/64Frank Vahid, UC Riverside

Warp Processing Challenges

Can we decompile binaries sufficiently for synthesis?

Can we just-in-time (JIT) compile to FPGAs?

µPI$

D$

FPGA

Profiler

On-chip CAD

BinaryBinary

Decompilation

BinaryFPGA binary

Profiling & partitioning

Binary Updater

BinaryMicrop Binary

CDFG

JIT FPGA compilation

17/64Frank Vahid, UC Riverside

Decompilation

Recover high-level information from binary: branches, loops, arrays, subroutines, …

Adapted previous methods for processor-processor translation (UQBT) Developed new synthesis-oriented methods (e.g., “reroll” loops,

strength “promotion”)

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Control/Data Flow Graph CreationOriginal C Code

Corresponding Assembly

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Data Flow Analysis

long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}

Function Recovery

long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}

Control Structure Recovery

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Array Recovery

Almost Identical Representations

18/64Frank Vahid, UC Riverside

Decompilation Results vs. C

Synthesis from decompiled binary is competitive with synthesis from C

Example Cycles ClkFrq Time Area Cycles ClkFrq Time Area %TimeOverhead %AreaOverhead

bit_correlator 258 118 2.2 15 258 118 2.2 15 0% 0%fir 129 125 1.0 359 129 125 1.0 371 0% 3%udiv8 281 190 1.5 398 281 190 1.5 398 0% 0%prewitt 64516 123 524.5 2690 64516 123 524.5 4250 0% 58%mf9 258 57 4.5 1048 258 57 4.5 1048 0% 0%moravec 195072 66 2951.2 680 195072 70 2790.7 676 -6% -1%

Avg: -1% 10%

Synthesis from C Synthesis from Decompiled Binary

0123456789

101112131415

Speedup

From C

From binary

19/64Frank Vahid, UC Riverside

Decompilation Results on Optimized H.264In-depth Study with Freescale

Again, competitive with synthesis from C

Function Name Instr %TimeCumulative SpeedupMotionComp_00 33 6.8% 1.1InvTransform4x4 63 12.5% 1.1FindHorizontalBS 47 16.7% 1.2GetBits 51 20.8% 1.3FindVerticalBS 44 24.7% 1.3MotionCompChromaFullXFullY24 28.6% 1.4FilterHorizontalLuma 557 32.5% 1.5FilterVerticalLuma 481 35.8% 1.6FilterHorizontalChroma133 39.0% 1.6CombineCoefsZerosInvQuantScan69 42.0% 1.7memset 20 44.9% 1.8MotionCompensate 167 47.7% 1.9FilterVerticalChroma 121 50.3% 2.0MotionCompChromaFracXFracY48 53.0% 2.1ReadLeadingZerosAndOne56 55.6% 2.3DecodeCoeffTokenNormal93 57.5% 2.4DeblockingFilterLumaRow272 59.4% 2.5DecodeZeros 79 61.3% 2.6MotionComp_23 279 63.0% 2.7DecodeBlockCoefLevels56 64.6% 2.8MotionComp_21 281 66.2% 3.0FindBoundaryStrengthPMB44 67.7% 3.1

0

1

2

3

45

6

7

8

9

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

# of Functions on FPGAS

pee

du

p

Ideal

From C

From binary

20/64Frank Vahid, UC Riverside

Decompilation Effective Even with Compiler Optimizations

Average Speedup of 10 Examples

0

5

10

15

20

25

30 Do compiler optimizations hurt decompilation?

(Surprisingly) found optimized code synthesizes to even better circuits

Sp

eedu

p w

hen d

eco

mp

iled b

inary

is

part

itio

ned a

nd s

ynth

esi

zed

to F

PG

A

21/64Frank Vahid, UC Riverside

Decompilation

Summary: Decompilation is surprisingly effective at recovering high-level program structures for synthesisStitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07

Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville)

22/64Frank Vahid, UC Riverside

Warp Processing Challenges

Can we decompile binaries sufficiently for synthesis?

Can we just-in-time (JIT) compile to FPGAs?

µPI$

D$

FPGA

Profiler

On-chip CAD

BinaryBinary

Decompilation

BinaryFPGA binary

Profiling & partitioning

Binary Updater

BinaryMicrop Binary

CDFG

JIT FPGA compilation

23/64Frank Vahid, UC Riverside

Expand

Reduce

Irredundant

dc-seton-set off-set Developed ultra-lean CAD heuristics for synthesis,

placement, routing, and technology mapping, e.g., Logic synthesis: run single expand phase Technology mapping: bottom-up graph clustering

heuristic Placement: place critical path first, then adjacent items Routing: use resource graph that matches switch

matrix / channel structure

Challenge: JIT Compile to FPGA 60 MB

Logic synthesis Tech. map. Placement Routing

9.1 s

Commercial tool

3.6MB0.2 s

Ultra-lean Riverside JIT FPGA tools (drawn to scale)

1.4s

Ultra-lean Riverside JIT FPGA tools on a 75MHz ARM7

3.6MB

Penalty: 1.3-2x in performance & size(even more might be acceptable)

24/64Frank Vahid, UC Riverside

JIT Compile to FPGA

Summary: Ultra-lean JIT FPGA compiler 40x speedup, 20x less memory, 1.3x-2x circuit penaltyLysecky et al, DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06

Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona)

25/64Frank Vahid, UC Riverside

191 130

0

10

20

30

40

50

60

70

80

Spee

dup

Warp Proc.

Warp Processing ResultsPerformance Speedup (Most Frequent Kernel Only)

Average kernel speedup of 41

1 = ARM-only execution

Overall application speedup average is 7.4

vs. 200 MHz ARM

µPI$

D$

FPGA

Profiler

On-chip CAD

26/64Frank Vahid, UC Riverside

µP

Warping Thread-Based Applications

FPGAµPµP

µP

OS

µP

f()

f()f()

Compiler

Binary

for (i = 0; i < 10; i++) {

thread_create( f, i );

}

f()

µP

On-chip CAD

Acc. Lib

f() f()

OS schedules threads onto available µPs

Remaining threads added to queue

OS invokes on-chip CAD tools to create accelerators for f()

OS schedules threads onto accelerators (possibly dozens), in addition to µPs

Thread warping: use one core to create accelerator for waiting threads

Very large speedups possible – parallelism at bit, arithmetic, and now thread level too

uP Warp

PerformanceMulti-core platforms multi-threaded apps

27/64Frank Vahid, UC Riverside

Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough

void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }

Memory Access Synchronization (MAS)

Same array

FPGA b()a()

RAMData for dozens of threads can create bottleneck

for (i = 0; i < 10; i++) {

thread_create( thread_function, a, i );

}DMA

Threaded programs exhibit unique feature: Multiple threads often access same or overlapping data

Solution: Fetch data once, broadcast to multiple threads (MAS)

….

28/64Frank Vahid, UC Riverside

Memory Access Synchronization (MAS)

Detect overlapping memory regions – “windows”

void f( int a[], int i ){ int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . }

for (i = 0; i < 100; i++) { thread_create( thread_function, a, i );}

a[0] a[1] a[2] a[3] a[4] a[5] ………

f() f() ……………… f()

DMA RAMA[0-103]

A[0-3] A[1-4] A[6-9]

Data streamed to “smart buffer”

Smart Buffer

Buffer delivers window to each thread

W/O smart buffer: 400 memory accessesWith smart buffer: 104 memory accesses

Synthesis creates active “smart buffer” [Guo/Najjar FPGA04] Actively fetches data, stores the reused data, delivers windows to threads Active rather than passive component; designed for specific threads

Each thread accesses different addresses – but addresses may overlap

enable

29/64Frank Vahid, UC Riverside

Speedups from Thread Warping

Chose benchmarks with extensive parallelism Four core (ARM11 400 MHz) base system Virtex IV FPGA at circuit-specific clock frequency (~100-300 MHz) Average 130x speedup

38

01020304050

Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean

4-uP

TW

8-uP

16-uP

32-uP

64-uP

Still 20x faster than 32-core system (and 11x faster than 64-core) Simulation pessimistic, actual results likely better FPGA more flexible

But, FPGA uses additional area. Our FPGA size = ~36 ARM11s

30/64Frank Vahid, UC Riverside

Warp Scenarios

µP

Time

µP (1st execution)

Time

On-chip CAD

µP FPGA

Speedup

Long Running Applications Recurring Applications

Long-running applications Scientific computing, etc.

Recurring applications (save and reuse FPGA configurations) Common in embedded systems Might view as (long) boot phase For networked/docked devices, CAD can occur on server (ongoing work)

On-chip CAD

Single-execution speedup

FPGA

Warping takes time (seconds, minutes, or more) – when useful?

31/64Frank Vahid, UC Riverside

Why Dynamic? Static good, but hiding FPGA opens technique to all

sw platforms Standard languages/tools/binaries

On-chip CAD

FPGA

µP

Any Compiler

FPGA

µP

Specialized Compiler

Binary Netlist Binary

Specialized Language Any Language

Static Compiling to FPGAs Dynamic Compiling to FPGAs

Applications

Tools Architectures

“Ecosystem”

32/64Frank Vahid, UC Riverside

Synthesis-Friendly Applications

Coding style impacts synthesis results

0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Sp

ee

du

p

Ideal Speedup (Zero-time Hw Execution)

Speedup After Rewrite (C Partitioning)

Speedup from C Partititioning

33/64Frank Vahid, UC Riverside

Synthesis-Friendly Application Coding Guidelines

Conversion to Constants (CC)

Conversion to Fixed Point (CF)

Conversion to Explicit Data Flow (CEDF)

Conversion to Explicit Memory Accesses (CEMA)

Function Specialization (FS)

Constant Input Enumeration (CIE)

Loop Rerolling (LR)

Conversion to Explicit Control Flow (CECF)

Algorithmic Specialization (AS)

Pass-By-Value Return (PVR)

Coding Guidelines

34/64Frank Vahid, UC Riverside

Conversion to Explicit Control Flow (CECF)

Problem: Function pointers may prevent static control flow analysis

Guideline: Don’t use function pointers. Replace with if-else, static calls

Makes possible targets explicit

void f( int (*fp) (int) ) { . . . . . for (i=0; i < 10; i++) { a[i] = fp(i); }}

enum Target { FUNC1, FUNC2, FUNC3 };void f( enum Target fp ) { . . . . . for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); }}

Synthesis unlikely to determine possible targets of function pointer

?a[i]

Synthesized Hardware

a[i]

Synthesized Circuit

f1(i) f2(i) f3(i)

3x1fp

35/64Frank Vahid, UC Riverside

Speedups from Synthesis-Friendly Coding Guidelines 10 guidelines For ~1,000 line benchmark: 5-6 changes typical, tens of

minutes each

0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Sp

eed

up

Ideal Speedup (Zero-time Hw Execution)

Speedup After Rewrite (C Partitioning)

Speedup from C Partititioning Simple guidelines increased speedup to 6.5x

36/64Frank Vahid, UC Riverside

Speedups from Synthesis-Friendly Coding Guidelines

573 842

0

5

10

15

20

g3fax crc jpeg brev fir mpeg2

SwHw /sw w ith original codeHw /sw w ith guidelines

Original C code (Powerstone, Mediabench) Original average speedups with FPGA: 2.6x (excludes brev)

Refined C code with guidelines Average speedup: 8.4x (excludes brev) Guidelines led to 3.5x improvement of speedup

37/64Frank Vahid, UC Riverside

“Spatial” Algorithms for FPGAs Example – Count patterns

Sequential algorithm Hash table 10s cycles per pattern

int patterns[1,000]; int counts[1,000];while (1) { WaitForPattern(); CurrPattern = X; hash = HashFct(CurrPattern); item = Find(patterns, CurrPattern, hash); if (item) { counts[item]++; } }

count

Level 1logic pattern

logicLevel 2

Level mlogic

CurrPattern

countpattern

countpattern

.

.

.

bus

Spatial algorithm Pipelined stages Essence is the connectivity

of components, not the sequencing of instructions

38/64Frank Vahid, UC Riverside

2n Count2n patterns

4 Count4 patterns

2 Count2 patterns

1 Count

Spatial Algorithms for FPGAs

Spatial algorithm 2 Pipelined binary tree Level 1

logic Memory1 pattern

logic Memory2 patterns

logic Memory4 patterns

Level 2

Level 3

Level nlogic Memory

2n patterns

.

.

.

Current pattern

.

.

.

39/64Frank Vahid, UC Riverside

Example

Stage 1

Stage 2

Stage 3

Stage 4

73

48 Level 1logic Memory

1 pattern

logicMemory2 patterns

logicMemory4 patterns

Level 2

Level 3

Level nlogic

Memory2n patterns

.

.

.

Current pattern

.

.

.

Possible patterns pre-stored in binary search tree circuit

40/64Frank Vahid, UC Riverside

Example

Stage 1

Stage 2

Stage 3

Stage 4

48

23 Level 1logic Memory

1 pattern

logicMemory2 patterns

logicMemory4 patterns

Level 2

Level 3

Level nlogic

Memory2n patterns

.

.

.

Current pattern

.

.

.

73

41/64Frank Vahid, UC Riverside

Example

Stage 1

Stage 2

Stage 3

Stage 4

23

75 Level 1logic Memory

1 pattern

logicMemory2 patterns

logicMemory4 patterns

Level 2

Level 3

Level nlogic

Memory2n patterns

.

.

.

Current pattern

.

.

.

48

73

42/64Frank Vahid, UC Riverside

Example

Stage 1

Stage 2

Stage 3

Stage 4

75

11 Level 1logic Memory

1 pattern

logicMemory2 patterns

logicMemory4 patterns

Level 2

Level 3

Level nlogic

Memory2n patterns

.

.

.

Current pattern

.

.

.

23

73

48

1

43/64Frank Vahid, UC Riverside

Example

Stage 1

Stage 2

Stage 3

Stage 4

11

Level 1logic Memory

1 pattern

logicMemory2 patterns

logicMemory4 patterns

Level 2

Level 3

Level nlogic

Memory2n patterns

.

.

.

Current pattern

.

.

.

48

23

1

75

1

1

44/64Frank Vahid, UC Riverside

Study of Spatial Algorithms in FCCM Year ApplicationType2001 3D Vec. Normalization Spatial2001 Efficient CAM --2001 Automated Sensor Temporal2001 Regular Expression Spatial2002 Hyperspectral Image Spatial2002 Machine VisionSpatial2002 RC4 Temporal2002 Set Covering Spatial2002 Template Matching Spatial2002 Triangle Mesh Spatial2003 Congruential Sieves Temporal2003 Content Scanning Temporal2003 F.P and Square Root Spatial2003 Gaussian NoiseSpatial2003 TRNG --2004 3D FDTD Method Spatial2004 Deep Packet Filter --2004 Online Floating Point --2004 Molecular Dynamics Spatial2004 Pattern Matching Spatial2004 Seismic Migration Spatial2004 Software Deceleration --2004 V.M Window --2005 Data Mining Spatial2005 Cell Automata Temporal2005 Particle Graphics Spatial2005 Radiosity Temporal2005 Transient Waves Spatial2005 Road Traffic Temporal2006 All Pairs Shortest Path Spatial2006 Apriori Data Mining Spatial2006 Molecular Dynamics Spatial2006 Gaussian Elimination Spatial2006 Radiation DoseTemporal2006 Random Variates Spatial

FCCM 2001-2006 70 papers describing fast

application on FPGA Examined 35 in depth (every

other one) 6 used device-specific features 9 represented expected

synthesized circuit from the obvious sequential algorithm

20 were spatially-oriented applications

e.g., earlier pipelined binary tree

45/64Frank Vahid, UC Riverside

Portable Spatial Applications? Current portable microprocessor binaries – sequential

Extensions for threads, processes, ... How support spatial constructs

Ports, connections, timing model .....

www.systemc.org

Adds libraries and macros, still standard C++

Sequential and spatial constructs Compiling links in the simulation kernel

Self-executing simulation Intended for SoC simulation

46/64Frank Vahid, UC Riverside

Bytecode Modern portability approach

Java, C#

PentiumAtomOpteron

bytecode

Compiler

VM VM VM

Virtual Machine (VM): Program that executes bytecode

May JIT compile to native architecture

47/64Frank Vahid, UC Riverside

SystemC Bytecode?

PentiumFPGA

SystemC bytecode

Compiler

VM VM

SystemC

Opteron+

FPGA

VM

48/64Frank Vahid, UC Riverside

UCR SystemC Bytecode and Compiler

class EDGE_DETECTOR : public sc_module {//signal declarations…EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady;

SC_method(getPixel); sensitive << clock.pos();

void getPixel(){ … dataReady.write(1);}

void mainComp(){ int i, j; for(i = 0; i < 3; i++){ for(j = 0; j < 3; j++){ sumX = sumX + mem.read()*GX[i][j] } } … edge.write(sumX + sumY)}

SystemC

--headersignal clock : 1signal reset : 1signal memory_in : 32signal fb_data : 32signal leds : 4

process(clock)READ $1 memory_inADD $2 $0 3ADD $3 $2 $1WRITE $3 s1ADDI $1 $0 1WRITE $1 dataReadyEND

process(dataReady)READ $5 val6 SW $5 24($0) READ $5 val7 …ADDI $10 $0 0 ADDI $7 $0 0ADDI $13 $0 8 …END

UCR’s SystemC bytecode

UCR’s SystemC-to-

bytecode compiler

MIPS-like sequential instructions

Spatial Constructs

49/64Frank Vahid, UC Riverside

SystemC Bytecode for FPGAs

Demo

50/64Frank Vahid, UC Riverside

SystemC Bytecode Emulator

Emulator

Input Memory

Output Memory

UART

Buttons

LEDs

Read Signal Memory

Write Signal Memory

Main Processor

Instruction Memory

USB Interface

FPGABytecode uploadable via USB drive

Accelerators speedup emulation

SystemC bytecode

51/64Frank Vahid, UC Riverside

SystemC Bytecode Accelerators

Emulator

Input Memory

Output Memory

UART

Buttons

LEDs

Read Signal Memory

Write Signal Memory

Main Processor

Instruction Memory

USB Interface

Accelerator 1

Accelerator 2

Accelerator 3FPGA

SystemC bytecode

Implementation MIPS-like multicycle RISC

datapath 100 MHz Clock ~33 Million Instr/Sec Communicates to core

emulator memory mapped registers

Area: ~5000 slices # of accelerators limited to #

of masters allowed on bus ~1200 lines of VHDL

Accelerator

RISC Datapath

Register File

Local Mem

Bus, start,load logic

52/64Frank Vahid, UC Riverside

Dynamic SystemC Accelerator Management

Emulator

Input Memory

Output Memory

UART

Buttons

LEDs

Read Signal Memory

Write Signal Memory

Main Processor

Instruction Memory

USB Interface

Accelerator 1

Accelerator 2

Accelerator 3FPGA

SystemC bytecode

Only a limited number of SystemC accelerators can fit on an FPGA fabric

Dynamically map processes to accelerators based on process usage

Involves online algorithms

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Random Biased Periodic

Sequence

(ms

)

Virtual machine

Big FPGA/no com

Big FPGA

Static preloaded

Greedy

AG

42 44 4311 12 10

Image Filter Example

53/64Frank Vahid, UC Riverside

Just-in-Time Synthesis

Emulator

Input Memory

Output Memory

UART

Buttons

LEDs

Read Signal Memory

Write Signal Memory

Main Processor

Instruction Memory

Accelerator 1

Accelerator 2

Accelerator 3FPGA

SystemC bytecode

Possible to even perform synthesis on-chip – “warp processing” (previous UCR work)

Send SystemC bytecode to synthesis server

FPGA Specific Bitstream

Dynamically reconfiguresome or all of the FPGA

57/64Frank Vahid, UC Riverside

Transmuting Coprocessors

Demo

58/64Frank Vahid, UC Riverside

FPGA is a Size-Limited Coprocessing Resource

CP library

00010010100100100101

00010010100100100101

New FPGA binary

DOOM: 23secBlowfish: 6sec

DOOM: 23secBlowfish: 6sec

User app profile info

ServerUser device

Internet

μ P

FPGA

DMABus

I/O

Memory

A software updateA coprocessor update

CP selection

CP placement

FPGA implement

s coprocesso

rs

Upload app profile

infoSelect

coproc. set, generate new FPGA bitstream

Send back new

bitstream, re-program

FPGA

Speedup with

previous apps

App executions change. Must decide which coprocessors should be FPGA-resident at a given time – transmuting

coprocessors

59/64Frank Vahid, UC Riverside

Transmuting Coprocessor Demo

Three image filters: Blur filter (S/L): Blur the image Sobel filter (S/L): Find the edge of

the image Emboss filter(S/L): Emboss the

image

Platform: Virtex 2P(XC2VP30): PPC +

Coprocessors PPC Frequency: 100Mhz Coproc. Frequency: 50Mhz

0

20

40

60

80

100

120

Ti me

MP Smal l CP Large CP

30x 120x

Size(slice) Small Large

Blur 30 120

Sobel 228 912

Emboss 81 324

60/64Frank Vahid, UC Riverside

Demo architecture

PPC Peripherals

InstructionBRAM

EDK

Interface to external

DisplayBRAM

ImageBRAM

Coproc

VGA control

VGA display

UART Push button

ISE

Image (128*128 pixels and 24bit color): 24 BRAMs

Soft version: Read (Image BRAM)Execution (PPC)Write (Display BRAM)

Coprocessor version: Read (Image BRAM)Execution(Coproc)Write (Display BRAM)

Dock: send the profile information through UART.

PLB

61/64Frank Vahid, UC Riverside

Coprocessor configurations

Microprocessor only Small blur+ small sobel Small blur + small emboss Small sobel + small emboss Large blur Large sobel Large emboss

Choose the configuration according to app profile info.

PPC Peripherals

Memory

Virtex2P

Coprocessor region

Blur (S)Sobel(S)

Blur (S)Emboss(s)

Sobel(s)Emboss(s)

Blur (L)Sobel (L)

Emboss(L)

63/64Frank Vahid, UC Riverside

µP

Cache

Dynamic Enables Expandable Logic Concept

RAM

Expandable RAM

uP

Performance

Profiler

µP

Cache

Warp Tools

DMA

FPGAFPGA

FPGA FPGA

RAM Expandable RAM – System detects RAM during start, improves performance invisibly

Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware.

Expandable Logic

64/64Frank Vahid, UC Riverside

Summary

FPGAs entering mainstream Portability of applications is important Dynamic binary translation to FPGAs – Warp

processing Shown feasible; Extensive future work

Trends towards FPGA ubiquity Microprocessor binaries need extensions for

spatial constructs One approach: SystemC bytecode and virtual

machine Can also be warped for circuit-speed

http://www.cs.ucr.edu/~vahid/pubs