Download pdf - FFT Compiler

1

FFT Compiler:From Math to Efficient Hardware

James C. HoeDepartment of ECE

Carnegie Mellon University

joint work with Peter A. Milder, Franz Franchetti, and Markus Pueschel in the SPIRAL project

with support from NSF ACR-0234293, ITR/NGS-0325687, DARPA NBCH-105000,Intel

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 2CMU/ECE/SPIRAL/Hoe

The SPIRAL Project

♦ High performance implementations of linear DSP transforms (DFT, DCT, DWT, filters, etc) are an important class of design problems

♦ Hand design and tuning is tricky and expensive- needs both math and implementation knowledge- time-consuming and tedious - needs to repeat effort for every new context

♦ SPIRAL research goal: A flexible push-button design generation framework that produces SW and HW implementations equal or better than expert hand design

Today I focus on FFT for HW

2


Why we can do better than hand design

♦ SPIRAL is only focused on linear DSP transforms

♦ These transforms are highly structured, highly regular and very well understood mathematically

♦ Algorithmic implementations of a transform can be enumerated following a known set of rules

♦ For a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough implementations


Outline

♦ SPIRAL Formula Framework

♦ SPIRAL for HW FFT cores

♦ Conclusions

3


Linear Transforms

♦ Linear transform is matrix-vector multiplication- computing by definition takes O(N2) operations- the matrix has structure

♦ E.g. discrete Fourier transform: y = DFTN ⋅ x

y0y1.yj..

yN-1

=

x0x1.xk..

xN-1

.Njkie π2−

k → 0 .. N-1

j →0 .. N

-1


“Fast” Algorithms♦ a “fast” algorithm factors the matrix into a sequence of

structured, sparse matricescheaper sparse multiplies ⇒ O(N log(N)) operations

♦ E.g. Cooley-Tukey Factorization of DFT4

♦ Matrix formula representation

⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅

−⋅⋅⋅⋅

⋅⋅−⋅⋅

⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅

−⋅⋅⋅−⋅

⋅⋅⋅⋅

=

−−−−−−

11

11

1111

1111

11

1

1111

1111

111111

111111

iii

ii

( ) ( ) mnnmn

mnnmnmn LDFTIDIDFTDFT ⋅⋅

⋅ ⊗⊗=

4


“Fast” Fourier Transform Algorithms♦ Recursively factorize by the Cooley-Tukey rule until only

leaf cases remain (e.g. DFTr for radix-r)

♦ Exponential number of alternatives

♦ Each ruletree corresponds a different algorithm♦ All cost O(N log(N))

( ) ( )( ) ( ) ( )( )( ) 8

24222

42222

8242

8242

82428

LLDFTIDIDFTIDIDFT

LDFTIDIDFTDFT

⊗⊗⊗⊗=

⊗⊗=

8DFT

2DFT 4DFT2DFT 2DFT

8DFT

4DFT 2DFT2DFT 2DFT


Formula to HW (Combinational)

♦ Given where is:- apply , then - is a permutation permute- apply , times in parallel- is a diagonal scale

A

A

B A

×4

×2

×7

×8

5


How about good HW?

♦ Matrix formulas have a natural mapping to dataflow and hence combinational datapath

♦ However, real hardware designs must fit a given resource constraint ⇒ sequential datapath that reuse available HW- identify repeated kernels- instantiate kernels under resource constraints- schedule computation to reuse instantiated kernels

We want to do the analysis, mapping and scheduling at formula level, with high-level algorithm knowledge


♦ Simple regular structure embodied in Pease FFT

♦ Example:

Regular Structure for HW

6


Pease DFT Example: DFT8

x

x

x

x

x

x

x

x

x

x

x

x

stage 1 stage 2 stage 3

(formula is applied from right to left)


Horizontal folding

♦ a baseline datapath for DFT (sans bit-reverse)♦ degree of freedom: vertical parallelism

- parameter pp

inputbypass

register

pp

7


Vertical (V-)folding according to p

latencyFineFine--grained control over cost/latency tradeoffgrained control over cost/latency tradeoff

cost


DFTgen [Milder, et al., DAC’05, FPGA’06]

functionality

controlarea/latencytradeoff

resourcemodel

8


Cost Performance Tradeoff

DFT64 DFT256 DFT1024

slices slices slices

late

ncy

(use

c)


♦ Simple regular structure embodied in formula

Structures of Interest for Parameterized Cores

♦ Product: Horizontal folding

♦ Tensor Product: parallelism or streaming vertical folding

♦ Stride permutation: rewiring or buffer/delay

9


Applicability to other transforms?

♦ DFT radix 2

♦ DFT radix 2r

♦ 2-D DFTnxn

♦ WHT

♦ DCT (type II) [ ] Hk

iik k

k

k

k

k

k

PLLLADP2

22

22

1

0

22 11 −−∏

−

=−

( )[ ]∏=

⊗1

0

2

inn

nn DFTIL

( )[ ]∏−

=−− ⊗

1

0

22222 11

k

ii

k

kkk LDFTITR

( )[ ]∏−

=−− ⊗

1/

0

22222

rk

ii

k

rkrrkk LDFTITR

( )[ ]∏−

=−− ⊗

1/

0

2222

rk

i

k

rkrrk LWHTI


Conclusions

♦ Mathematical approach to DSP transform implementation

♦ Encapsulating domain knowledge in a domain specific tool for 100% automatic design generation and optimization

♦ Generalizable to other linear DSP transforms

♦ Thank you (Arvind)