38
SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter A. Milder 1 ([email protected]) Franz Franchetti, James C. Hoe, and Markus Pueschel 2 Department of ECE Department of ECE Carnegie Mellon University CMU/ECE/Hoe, February 2013, slide1 now with 1 SUNY Stonybrook and 2 ETH

SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

SPIRAL DSP Transform Compiler:Application Specific Hardware Synthesis

Peter A. Milder1 ([email protected])Franz Franchetti, James C. Hoe, and Markus Pueschel2

Department of ECEDepartment of ECECarnegie Mellon University

CMU/ECE/Hoe, February 2013, slide‐1

now with 1SUNY Stonybrook and 2ETH 

Page 2: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

The SPIRAL ProjectThe SPIRAL Project• High performance implementations of linear DSP 

f (DFT DCT DWT fil )transforms (DFT, DCT, DWT, filters, etc) are an important class of design problemsH d d i d t i i t i k d i• Hand design and tuning is tricky and expensive– needs both math and implementation knowledge

i i d di– time‐consuming and tedious – needs to repeat effort for every new context

• SPIRAL research goal: A flexible push‐button design generator that produces SW & HW implementations comparable with expert hand design

CMU/ECE/Hoe, February 2013, slide‐2

comparable with expert hand design

Page 3: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Why we can do better than hand designy g

• SPIRAL is only focused on linear DSP transforms

• These transforms are highly structured, highly regular and very well understood mathematicallyegu a a d e y e u de stood at e at ca y

• Algorithmic implementations of a transform can be d f ll i k f lenumerated following a known set of rules

• For a given objective function and mapping target, aFor a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough 

CMU/ECE/Hoe, February 2013, slide‐3

implementations 

Page 4: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

SPIRAL Framework

SPIRAL

I want a DFT of size 1024on a {Xilinx, P4, Cell....} 

SPIRAL automationstarts here

where mostwhere mosttools beginautomatingthe problem

CMU/ECE/Hoe, February 2013, slide‐4

Principle 1: Domain knowledge in the systemPrinciple 2: Optimization at a high level of abstraction

Page 5: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

www.spiral.net/hardware/dftgen.html

CMU/ECE/Hoe, February 2013, slide‐5

Page 6: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

High‐Level, Quality, and SpecializationHigh Level, Quality, and Specialization

High level:High‐level:tools know

better than you

RTL Synthesis: general purpose

y

RTL Synthesis: general‐purposebut special handling of

structures like FSM, arith, etc.

Place and Route: works the same

, ,

CMU/ECE/Hoe, February 2013, slide‐6

Place‐and‐Route: works the sameno matter what design

Page 7: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

OutlineOutline• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐coreSPIRAL for HW FFT  un core

CMU/ECE/Hoe, February 2013, slide‐7

Page 8: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Linear TransformsLinear Transforms• Linear transform is a matrix‐vector multiplication

– computing by definition takes O(N2) operations– the matrix has structure

• E.g. discrete Fourier transform: y = DFTN xy0y1.

x0x1.

k 0 .. N-1

j

yj.

= xk.

.Njkie 2j

0 ..

CMU/ECE/Hoe, February 2013, slide‐8

.yN-1

.xN-1

N-1

Page 9: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

“Fast” AlgorithmsFast  Algorithms• a “fast” algorithm factors the matrix into a sequence of structured sparse matricesstructured, sparse matricescheaper sparse multiplies  O(N log(N)) operations

• E g Cooley Tukey Factorization of DFT• E.g. Cooley‐Tukey Factorization of DFT4

11

1111

11

1111

111111ii

11

1

1111

111

1

1111

11

111111

11

iii

ii

• Matrix formula representation

44

CMU/ECE/Hoe, February 2013, slide‐9

4222

42224 LDFTIDIDFTDFT

Page 10: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Factorization RulesFactorization RulesE.g. Cooley‐Tukeyg y y

mnnmn

mnnmnmn LDFTIDIDFTDFT

11

– DFT2 is – D is a diagonal matrix of twiddle factors

1111

– L is a stride permutation matrix– AB=[aj,kB] is the tensor (or kronecker) product

A In

a0,0a0,0a0,000 a0,1a0,1a0,10

0

a1,0a1 00 a1,1a1 1

00BBB

e g I B

CMU/ECE/Hoe, February 2013, slide‐10

A In a1,0a1,00a1,1a1,10

0

B

e.g., In B

Page 11: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

“Fast” Fourier Transform AlgorithmsFast  Fourier Transform Algorithms• Recursively factorize by the Cooley‐Tukey rule until only leaf cases remain (e g DFT for radix r)only leaf cases remain (e.g. DFTr for radix‐r)

8242

82428 LDFTIDIDFTDFT

• Exponential number of alternatives

82

4222

42222

8242 LLDFTIDIDFTIDIDFT

Exponential number of alternatives 8DFT

2DFT 4DFT

8DFT

4DFT 2DFT

• Each ruletree corresponds a different algorithm

2DFT 4DFT2DFT 2DFT2DFT 2DFT

CMU/ECE/Hoe, February 2013, slide‐11

• Each ruletree corresponds a different algorithm• All cost O(N log(N))

Page 12: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

A System of Transforms and Rulesy

2

)(2 2/1,1 FdiagDCT II

QnIV

nII

nII

n FIDCTDCTPDCT 22/)(

2/)(

2/)(

DDCTSDCT IIn

IVn )()(

IV )(r

IVn MMDCT 1

)(

PDFTIDIDFTDFT

CDSTDCTBDFT In

Inn )( )(

2/)(2/

50+ transforms150+ rules

))(()()( // hFIIIhF ddnkdk

dnn

PDFTIDIDFTDFT mnmnnm

EhCirchF )()(

( )n

WHT I WHT I

EWIPIWDWTWDWT knnnn )())(()( 2/2/2/

EhCirchFn )()(

CMU/ECE/Hoe, February 2013, slide‐12

1 1 12 2 2 21

( )n n n n n ni i i t

iWHT I WHT I

Page 13: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Algorithmic Design Spaceg g psize # of DFT # of DCT‐IV

248

1640

1101268

163264

40296

27744162570361280

12631242

1924443362734381512163135424264

128256512

162570361280~1.01  1027~2.31  1061~2 86 10133

7343815121631354242~1.07  1038~2.30  1076~1 06 10153512 2.86  10133 1.06  10153

Different characteristics: data flow, numerical stability operation ordering working set size

CMU/ECE/Hoe, February 2013, slide‐13

stability, operation ordering, working set size, datapath regularity

Page 14: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Design Space: SW DCT 32 on P4Design Space: SW DCT 32 on P4 Histogram of 10,000 randomly selected algorithms

histogram by runtime(P4, 3.2 GHz)

CMU/ECE/Hoe, February 2013, slide‐14

histogram by num. accuracy

Page 15: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

OutlineOutline• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐coreSPIRAL for HW FFT  un core

CMU/ECE/Hoe, February 2013, slide‐15

Page 16: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Formula to HW (Combinational)Formula to HW (Combinational)• Given                        where       is:

– apply then– apply     , then – is a permutation permute– apply     ,     times in parallel

i di l l– is a diagonal scale 

B A

A×7

×8

CMU/ECE/Hoe, February 2013, slide‐16

A ×4

×2

Page 17: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

DFT8 Datapath Example

xx

x

x x

x

x

x

x

x

CMU/ECE/Hoe, February 2013, slide‐17

82

4222

42222

82428 LLDFTIDIDFTIDIDFTDFT

(formula is applied from right to left)

Page 18: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Pease DFT8 Example8 p

x x x

x x xx

x

x

x

x

xx x x

x x x

stage 1 stage 2 stage 3

CMU/ECE/Hoe, February 2013, slide‐18

Page 19: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

How about good HW?g• Matrix formulas have a natural mapping to dataflow and hence combinational datapathdataflow and hence combinational datapath

• However, real hardware designs must fit a given resource constraintresource constraint  sequential datapath that reuse available HW

identify repeated kernels– identify repeated kernels– instantiate kernels under resource constraints

h d l t ti t i t ti t d– schedule computation to reuse instantiated kernels

We want to do the analysis and mapping at

CMU/ECE/Hoe, February 2013, slide‐19

We want to do the analysis and mapping at formula level, with high‐level algorithm knowledge

Page 20: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Tensor as Streaming Parallelismg

partially streamedfully parallel fully streamed

CMU/ECE/Hoe, February 2013, slide‐20

Page 21: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Pease DFT Example: DFTPease DFT Example:  DFT8x x x

x x xx

x

x

x

x

xx x x

x x x

stage 1 stage 2 stage 3

CMU/ECE/Hoe, February 2013, slide‐21

Page 22: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Pease DFT Example: DFT StreamingPease DFT Example:  DFT8 Streaming

x x xx x x

x x x

f(L82)f(L8

2) f(L82)f(L8

2) f(L82)f(L8

2) f(L82)f(R8)

x x x

x x xf(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(R8)

stage 1 stage 2 stage 3

CMU/ECE/Hoe, February 2013, slide‐22

Page 23: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Regular Structure for HW• Simple regular structure embodied in Pease FFT

Regular Structure for HW

• Example:

CMU/ECE/Hoe, February 2013, slide‐23

Page 24: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Formally representing horizontal reuseFormally representing horizontal reuse

hr hrhr

not horizontally horizontally partially horizontally

CMU/ECE/Hoe, February 2013, slide‐24

yreused

yreused

p y yreused

Page 25: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Iterative Reuse of Logicg

latencyt

CMU/ECE/Hoe, February 2013, slide‐25

latencyFine-grained control over cost/latency tradeoff

cost

Page 26: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Example: rewriting rulesf ifor streaming reuse

CMU/ECE/Hoe, February 2013, slide‐26

26

Page 27: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Applicability to other transforms?pp y• DFT radix 2

1

0

22222 11

k

ii

k

kkk LDFTITR

• DFT radix 2r

0i

1/

0

22222

rk

i

k

rkrrkk LDFTITR

• 2‐D DFTnxn 1

2

nnnn DFTIL

0i

nxn

• WHT

0i

nnn

1/

2rk

k

kk LWHTIWHT

• DCT (type II) Hk

kkk

PLLLADP 221

2

0

222i

rkrrk LWHTI

CMU/ECE/Hoe, February 2013, slide‐27

• DCT (type II) H

iik kkk PLLLADP 2

22

22

0

22 11

Page 28: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

FPGA: Area vs ThroughputFPGA: Area vs. Throughput

Pareto optimal

49x slices132x throughput

CMU/ECE/Hoe, February 2013, slide‐28

28

Page 29: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

OutlineOutline• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐coreSPIRAL for HW FFT  un core

CMU/ECE/Hoe, February 2013, slide‐29

Page 30: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

A 2D‐FFT AlgorithmA 2D‐FFT Algorithm• Row‐column algorithm:g

2D-

Row StageColumn Stage

Dataset:Dataset:(Logical abstractionof the 2D dataset)

… …

CMU/ECE/Hoe, February 2013, slide‐30

Page 31: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Off‐chip Data SetsOff chip Data Sets

off‐chipDRAM

DFTnk l

on‐chipSRAMDRAM kernelSRAM

• Need to balance– kernel processing bandwidth– off‐chip memory bandwidth

CMU/ECE/Hoe, February 2013, slide‐31

– on‐chip storage capacity

Page 32: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Inefficient DRAM Access Patterns• Row‐wise traversal ‐> Sequential accesses• Column‐wise traversal ‐> Large strided accesses

row‐major 2D array

lin

0

n

ear mem

Row buffersizem

 space…

CMU/ECE/Hoe, February 2013, slide‐32

nn2

Page 33: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

How to Optimize the Access Patterns p

0k2

row‐major  “blocked”

nk

linear mn

kRow buffer

sizemem

 spa

n

2

ace

CMU/ECE/Hoe, February 2013, slide‐33

n2in row‐buffersized chunks  [Akin, et al., FCCM 2012]

Page 34: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

Design Generator w/ Tensor Formalismg /

row column algorithm

row stagecolumn stage

2D row‐column algorithm

symmetric algorithm

2D-

symmetric algorithm

symmetric algorithm y gwith tiling

read tileslinearizehi

FFT processingtransposeand re‐tile

write tilescolumn‐wise

CMU/ECE/Hoe, February 2013, slide‐34

row‐wiseon‐chipand re‐tileon‐chip

column wise

[Akin, et al., FCCM 2012]

Page 35: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

2D‐FFT (double) Raw Performance( )

CMU/ECE/Hoe, February 2013, slide‐35

Problem Size[Akin, et al., FCCM 2012]

Page 36: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

2D‐FFT (double) BW Efficiency( ) y

CMU/ECE/Hoe, February 2013, slide‐36

Problem Size[Akin, et al., FCCM 2012]

Page 37: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

2D‐FFT (double) Power Efficiency( ) y

CMU/ECE/Hoe, February 2013, slide‐37

Problem Size[Akin, et al., FCCM 2012]

Page 38: SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

ConclusionsConclusions• Encapsulating domain knowledge in a domain p g gspecific tool for high‐level design automation

• SPIRAL– mathematical approach to DSP transform implementation (cores and “un”‐core)

– generalizable to other linear DSP transforms– as good as best expert designer

• Thank you

CMU/ECE/Hoe, February 2013, slide‐38