55
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida

Embed Size (px)

Citation preview

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Greg StittDepartment of Electrical and Computer Engineering

University of Florida

2/55

Introduction

Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell

phones, etc. Future architectures - Speech/image recognition, self-

guiding cars, computation biology, etc.

3/55

Introduction

FPGAs (Field Programmable Gate Arrays) – Implement custom circuits

10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05],

But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into

mainstream Make FPGAs “Invisible”

uPFPGA

Perf

orm

ance

FPGAs capable of large performance improvements

4/55

Introduction – Hardware/Software Partitioning

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

for (i=0; i < 16; i++) y[i] += c[i] * x[i]......

C Code for FIR Filter

Processor Processor

~1000 cycles

Compiler

0102030405060708090

100

Time Energy

Sw

Hardware/software partitioning selects performance critical regions for hardware implementation

[Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94]

Processor FPGA

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Designer creates custom hardware using hardware description language (HDL)

Hardware for loop

0102030405060708090

100

Time Energy

Hw/ SwSw

~ 10 cycles Speedup = 1000 cycles/ 10

cycles = 100x

5/55

Introduction – High-level Synthesis

Libraries/Object Code

Libraries/Object Code

Updated Binary

High-level Code

Decompilation

High-level Synthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

Problem: Describing circuit using HDL is time consuming/difficult

Solution: High-level synthesis Create circuit from high-

level code [Gupta, DeMicheli 92][Camposano,

Wolf 91][Rabaey 96][Gajski, Dutt 92]

Allows developers to use higher-level specification

Potentially, enables synthesis for software developers

DecompilationHw/Sw Partitioning

Compiler

6/55

Introduction – High-level Synthesis

Problem: Describing circuit using HDL is time consuming/difficult

Solution: High-level synthesis Create circuit from high-

level code [Gupta, DeMicheli 92][Camposano,

Wolf 91][Rabaey 96][Gajski, Dutt 92]

Allows developers to use higher-level specification

Potentially, enables synthesis for software developers

Libraries/Object Code

Libraries/Object Code

Updated Binary

High-level Code

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

DecompilationHigh-level Synthesis

7/55

Introduction – High-level Synthesis

Problem: Describing circuit using HDL is time consuming/difficult

Solution: High-level synthesis Create circuit from high-

level code [Gupta, DeMicheli 92][Camposano,

Wolf 91][Rabaey 96][Gajski, Dutt 92]

Allows developers to use higher-level specification

Potentially, enables synthesis for software developers

for (i=0; i < 16; i++) y[i] += c[i] * x[i]

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

DecompilationHigh-level Synthesis

8/55

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis

Key techniques for synthesis from binaries Decompilation

Current and Future Directions Multi-threaded Warp Processing Custom Communication

9/55

Problems with High-Level Synthesis

Problem: High-level synthesis is unattractive to software developers

Requires specialized language

SystemC, NapaC, HandelC, …

Requires specialized compiler

Spark, ROCCC, CatapultC, …

Limited commercial success

Software developers reluctant to change tools

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

Non-Standard Software Tool Flow

Updated BinarySpecialized Language

DecompilationSpecialized Compiler

10/55

Warp Processing – “Invisible” Synthesis

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-Level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

Solution: Make synthesis “invisible” 2 Requirements

Standard software tool flow

Perform compilation before synthesis

Hide synthesis tool Move synthesis on

chip Similar to dynamic

binary translation [Transmeta]

But, translate to hw

DecompilationSynthesis

DecompilationCompiler

Updated BinaryHigh-level CodeLibraries/

Object Code

Libraries/Object Code

Updated BinarySoftware Binary

HardwareHardwareSoftwareSoftware

Move compilation before synthesis

Standard Software Tool Flow

11/55

Warp Processing – “Invisible” Synthesis

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-Level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftwareDecompilationSynthesis

DecompilationCompiler

Updated BinaryHigh-level CodeLibraries/

Object Code

Libraries/Object Code

Updated BinarySoftware Binary

HardwareHardwareSoftwareSoftware

Solution: Make synthesis “invisible” 2 Requirements

Standard software tool flow

Perform compilation before synthesis

Hide synthesis tool Move synthesis on

chip Similar to dynamic

binary translation [Transmeta]

But, translate to hw

Warp processor looks like standard uP but invisibly synthesizes hardware

12/55

Warp Processing – “Invisible” Synthesis

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-Level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftwareDecompilationSynthesis

DecompilationCompiler

Updated BinaryHigh-level CodeLibraries/

Object Code

Libraries/Object Code

Updated BinarySoftware Binary

HardwareHardwareSoftwareSoftware

Advantages Supports all

languages,compilers, IDEs

Supports synthesis of assembly code

Support synthesis of library code

Also, enables dynamic optimizations

Updated BinaryC, C++, Java, Matlab

Decompilationgcc, g++, javac, keil

Warp processor looks like standard uP but invisibly synthesizes hardware

13/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary

14/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

ProfilerI Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP

15/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected

16/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD

17/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD converts critical region

into control data flow graph (CDFG)55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

18/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

19/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

20/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary88

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”

21/55

µP

Cache

Expandable Logic

RAM

Expandable RAM

uP

Performance

Profiler

µP

Cache

Warp Tools

DMA

FPGAFPGA

FPGA FPGA

RAM Expandable RAM – System detects RAM during start, improves performance invisibly

Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware.

Expandable Logic

22/55

Expandable Logic

Allows for customization of platforms User can select FPGAs based on used

applicationsApplication

Portable Gaming

Performance

Unacceptable Performance

23/55

Expandable Logic

Allows for customization of platforms User can select FPGAs based on used

applicationsApplication

Portable Gaming

Performance

. . . .

. . . .

•User can customize FPGAs to the desired amount of performance•Performance improvement is invisible – doesn’t require new binary from the developer

24/55

Expandable Logic

Allows for customization of platforms User can select FPGAs based on used

applicationsApplicationWeb Browser

Performance

Acceptable Performance

No-FPGA

•Platform designer doesn’t have to decide on fixed amount of FPGA.

•User doesn’t have to pay for FPGA that isn’t needed

25/55

uPI$

D$

FPGA

Profiler

On-chip CAD

Warp Processing Background: Basic Technology

Challenge: CAD tools normally require powerful workstations

Develop extremely efficient on-chip CAD tools

Requires efficient synthesis Requires specialized FPGA, physical

design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04],

University of Arizona

BinaryBinary

BinaryHW

Synthesis

Technology Mapping

Placement & Routing

Logic Optimization

BinaryUpdated Binary

JIT F

PG

A

com

pila

tio

n

26/55

Warp Processing Background: On-Chip CAD

60 MB

9.1 s

Xilinx ISE

Manually performed

3.6MB0.2 s

On-chip CAD

On a 75Mhz ARM7: only 1.4 s

46x improvement30% perf. penalty

Log.

Opt

.

Tech

. Map

Plac

e

Rou

te

RT

Syn.

Synt

hesi

s

27/55

Warp Processing: Initial Results - Embedded Applications

Average speedup of 6.3x Achieved completely transparently

Also, energy savings of 66%

0

3

6

9

12

15

brev

g3fa

x url

rocm

pktflo

wca

nrdr

bitm

np

tblo

ok

ttspr

k

mat

rix idct

g721

mpe

g2 fir

mat

mul

Avera

ge:

Med

ian:

Benchmarks

Sp

ee

du

p

28/55

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis

Key techniques for synthesis from binaries Decompilation

Current and Future Directions Multi-threaded Warp Processing Custom Communication

29/55

Binary Synthesis Warp processors perform

synthesis from software binary – “binary synthesis”

Problem: No high-level information

Synthesis needs high-level constructs

> 10x slowdown

Can we recover high-level information for synthesis?

Make binary synthesis (and Warp processing) competitive with high-level synthesis

for (i=0; i < 128; i++) y[i] += c[i] * x[i]....

for (i=0; i < 128; i++) y[i] += c[i] * x[i]....

Compiler

Addi r1, r0, 0Ld r3, 256(r1)Ld r4, 512(r1)Subi r2, r1, 128Jnz r2, -5

No high-level constructs – arrays, loops, etc.

Binary Synthesis

Processor FPGAHardware can be > 10x to 100x

30/55

Decompilation We realized decompilation recovers

high-level information But, generally used for binary translation or

source-code recovery May not be suitable for synthesis

We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain

Determined relevant techniques Adapted existing techniques for synthesis

31/55

Decompilation – Control/Data Flow Graph Recovery

Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps

Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks

[Cifuentes 99, 00]

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Control/Data Flow Graph CreationOriginal C Code

Corresponding Assembly

32/55

Decompilation – Data Flow Analysis

Original purpose - remove temporary registers Area overhead – 130%

Need new techniques for binary synthesis

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Data Flow Analysis

33/55

Decompilation – Data Flow Analysis

Strength Reduction – Compare-with-zero instructions

Operator Size Reduction

Sub reg3, reg4, reg5 Bz reg3, -5

reg4 reg5

Sub

reg3

=

0

Branch?

Not needed, wastes area

32-bit reg4

32-bit +

32-bit reg5

32-bit reg3

Lb reg4, 0(reg1)Mvi reg5, 16Add reg3, reg4, reg5 8-bit +

8-bit reg3Only 8-bit adder needed

reg4

=

reg5

Branch?

Optimized DFG

Area Overhead Reduced to 10%

8-bit reg4 5-bit reg5

Optimized DFG

Load Byte 16

34/55

Decompilation – Function Recovery

Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}

Function Recovery

35/55

Decompilation – Control Structure Recovery

Recover loops, if statements Uses interval analysis techniques

[Cifuentes 94]

100% success rate

long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}

Control Structure Recovery

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

36/55

Decompilation – Array Recovery

Detect linear memory patterns and row-major ordering calculations

~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00]

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Array Recovery

37/55

Comparison of Decompiled Code and Original Code

Decompiled code almost identical to original code

Only difference is variable names Binary synthesis is competitive with high-level

synthesis

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Decompiled Code

Almost Identical Representations

38/55

Libraries/Object Code

Binary Synthesis Tool Flow

Binary Synthesis

BinaryBinary

DecompilationDecompilation

HardwareHardwareSoftwareSoftware

Libraries/Object Code

Hardware Netlists

Hardware Netlists

BitstreamBitstream

ProfilingSynthesisProfilingBinary Updater

Hw/Sw Estimation

Hw/Sw Estimation

Hw/Sw Partitioning

Hw/Sw Partitioning

ProfilingProfiling

Updated Binary

High-level Source

DecompilationCompiler

BinaryBinary

BitstreamBitstream

uP FPGA

Updated Binary

Updated Binary

Initially, high-level source is compiled and linked to form a binary

Recovers high-level information needed for synthesis

Modifies binary to use synthesized hardware

~30,000 lines of C code

39/55

0123456789

101112131415

Sp

eed

up

FIR Fi

lter

Beam

form

er

Vite

rbi

Brev

Url

BITMNP0

1

IDCTR

N01

PNTR

CH01

Aver

age

High-level

Binary-level

Binary Synthesis is Competitive with High-Level Synthesis

Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better

Commercial products beginning to appear Critical Blue, Binachip

Small difference in speedup

40/55

Binary Synthesis with Software Compiler Optimizations

But, binaries generated with few optimizations Optimizations for software may hurt

hardware Need new decompilation techniquesC code

SW Compiler

Optimized Binary

uP FPGA

Binary Synthesis

Binary is optimized for software

Hardware synthesized from optimized binary may be inefficient

41/55

Loop Rerolling

Solution: We introduce loop rerolling to undo loop unrolling

Problem: Loop unrolling may cause inefficient hardware

Longer synthesis times Super-linear heuristics Unrolling 100 times =>

synthesis time is 1002 times longer

Larger area requirements Unrolling by compiler unlikely

to match unrolling by synthesis Loop structure needed for

advanced synthesis techniques

Non-unrolled Unrolled

Synthesis Execution Times

Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5

Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1

Non-unrolled Loop Unrolled Loop

42/55

Loop Rerolling – Identifying Unrolled Loops

x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;

Original C Code

Find Consecutive Repeating Substrings: Adjacent Nodes with Same SubstringUnrolled Loop

2 unrolled iterationsEach iteration = abc (Ld, Add, St)

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Binary

x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;

Unrolled Loop

Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D

Map to String

BABCABCD

String Representatio

n

Idea - Identify consecutively repeating instruction sequences

abc c db

abcabcd c abcd d abcd d

dabcd

Suffix Tree

[Ukkonen 95]

43/55

Loop Rerolling

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Original C Code

Unrolled Loop Identificiation

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Determine relationship of constants

1)

Add r3, r3, 1i=0loop:Ld r0, b(i)Add r1, r0, 1St a(i), r1Bne i, 2, loopMov r4, r3

Replace constants with induction variable expression

2)

reg3 = reg3 + 1;for (i=0; i < 2; i++) array1[i]=array2[i]+1;reg4=reg3;

Rerolled, decompiled code

3)

x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;

Average Speedup of 1.6x

44/55

Strength Promotion

+

++

<< <<

B[i+1] 4B[i+1] 1

+

<< <<

B[i] 3 B[i] 1

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

However, some of the strength reduction was beneficial

Strength promotion lets synthesis decide on strength reduction, not software compiler

Average Speedup of 1.5

Identify strength-reduced subgraphs

+

++

<< <<

B[i+1] 4B[i+1] 1

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

Replace with multiplication

++

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

B[i] 18

*

++

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

B[i] 18

*

B[i] 34

*

++

+

A[i]

B[i] 10

*

B[i] 18

*

B[i] 34

*

B[i] 66

*

1

++

B[i+1] 18B[i] 10

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]

+

A[i]

* *

Synthesis reapplies strength reduction to get optimal DFG

Problem: Strength reduction may cause inefficient hardware

45/55

Multiple ISA/Optimization Results

What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible

What about different instructions sets? Side effects may degrade hardware performance

0

5

10

15

20

25

30

Speedups similar on MIPS for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar on ARM for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar between ARM and MIPS

Complex instructions of ARM didn’t hurt synthesis

MicroBlaze speedups much larger

MicroBlaze is a slower microprocessor

-O3 optimizations were very beneficial to hardware

0

5

10

15

20

25

30

MIP

S -O1

MIP

S -O3

ARM -O

1

ARM -O

3

Micr

oBlaz

e -O

1

Micr

oBlaz

e -O

3

Sp

eed

up

46/55

High-level vs. Binary Synthesis: Proprietary H.264 Decoder

MPEG2 H.264

High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor

H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality

47/55

High-level vs. Binary Synthesis: Proprietary H.264 Decoder

Binary synthesis was competitive with high-level synthesis High-level speedup – 6.56x Binary speedup – 6.55x

0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Sp

eed

up

Speedup (High-level)

Speedup (Binary)

Binary synthesis competitive with high- level synthesis

48/55

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis

Key techniques for synthesis from binaries Decompilation

Current and Future Directions Multi-Threaded Warp Processing Custom Communication

49/55

Thread Warping - Overview

Profiler

µP

Warp Tools

Warp FPGA

µP

µP µPOS

a( ) b( )

b( )

for (i=0; i < 10; i++) createThread( b );

Function a( )

OS

Thread Queue

b( ) b( ) b( ) b( )b( ) b( )b( )b( )

Warp Toolsb( )

Warp FPGA

b( )

b( )

b( )

b( )b( )

b( ) b( )

b( )

OS can only schedule 2 threads

Remaining 8 threads placed in thread queue

Warp tools create custom accelerators for b( )

OS schedules 4 threads to custom accelerators

3x more thread parallelism

Architectural Trend – Include more cores on chip

Result – More multi-threaded applications

50/55

Thread Warping - Overview

Profiler

µP

Warp Tools

Warp FPGA

µP

µP µPOS

a( ) b( )

b( )

for (i=0; i < 10; i++) createThread( b );

Function a( )

Warp Toolsb( )

Profiler

Profiler detects performance critical loop in b( )

Warp FPGA

b( )

b( )

b( )

b( ) Warp tools create larger/faster accelerators

b( )b( ) b( )b( )

Potentially > 100x speedup

51/55

130 502 63 130 38308

01020304050

Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean

4-uP

TW

8-uP

16-uP

32-uP

64-uP

Thread Warping - ResultsThread warping 120x faster than 4-uP (ARM) system

Comparison of thread warping (TW) and multi-core

Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA

52/55

Warp Processing – Custom Communication

µP µP

µP µP

Problem: Best topology is application dependent

Bus Mesh

Bus Mesh

App1

App2

NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]

Perf

orm

ance

Perf

orm

ance

53/55

Warp Processing – Custom Communication

FPGA

NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]

Problem: Best topology is application dependent

Bus Mesh

Bus Mesh

App1

App2

µP µP

µP µP

Warp processing can dynamically choose topology – 2x to 100x improvement

FPGA

µP µP

µP µP

FPGA

µP µP

µP µP

Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing”

Perf

orm

ance

Perf

orm

ance

54/55

Summary

uPI$

D$

FPGA

Profiler

On-chip CAD

Updated BinaryAny Language

Updated BinaryStandard Binary

DecompilationAny Compiler

Developer is unaware of FPGA/synthesis

BinaryBinary

BinaryHW

Binary Synthesis

JIT FPGA Compilation

BinaryUpdated Binary

Decompilation makes possible

FPGA

Expandable Logic

Warp Processing

uP

Performance

Warp processing invisibly achieves > 100x speedups

55/55

References Patent

Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004

1. Hardware/Software Partitioning of Software Binaries G. Stitt and F. VahidIEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164- 170.

2. Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659-681.

3. Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES)

4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007.

5. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547-554.

6. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285-290.

7. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp.396-397.

8. Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250-255.

Supported by NSF, SRC, Intel, IBM, Xilinx