1
FFT Compiler:From Math to Efficient Hardware
James C. HoeDepartment of ECE
Carnegie Mellon University
joint work with Peter A. Milder, Franz Franchetti, and Markus Pueschel in the SPIRAL project
with support from NSF ACR-0234293, ITR/NGS-0325687, DARPA NBCH-105000,Intel
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 2CMU/ECE/SPIRAL/Hoe
The SPIRAL Project
♦ High performance implementations of linear DSP transforms (DFT, DCT, DWT, filters, etc) are an important class of design problems
♦ Hand design and tuning is tricky and expensive- needs both math and implementation knowledge- time-consuming and tedious - needs to repeat effort for every new context
♦ SPIRAL research goal: A flexible push-button design generation framework that produces SW and HW implementations equal or better than expert hand design
Today I focus on FFT for HW
2
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 3CMU/ECE/SPIRAL/Hoe
Why we can do better than hand design
♦ SPIRAL is only focused on linear DSP transforms
♦ These transforms are highly structured, highly regular and very well understood mathematically
♦ Algorithmic implementations of a transform can be enumerated following a known set of rules
♦ For a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough implementations
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 4CMU/ECE/SPIRAL/Hoe
Outline
♦ SPIRAL Formula Framework
♦ SPIRAL for HW FFT cores
♦ Conclusions
3
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 5CMU/ECE/SPIRAL/Hoe
Linear Transforms
♦ Linear transform is matrix-vector multiplication- computing by definition takes O(N2) operations- the matrix has structure
♦ E.g. discrete Fourier transform: y = DFTN ⋅ x
y0y1.yj..
yN-1
=
x0x1.xk..
xN-1
.Njkie π2−
k → 0 .. N-1
j →0 .. N
-1
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 6CMU/ECE/SPIRAL/Hoe
“Fast” Algorithms♦ a “fast” algorithm factors the matrix into a sequence of
structured, sparse matricescheaper sparse multiplies ⇒ O(N log(N)) operations
♦ E.g. Cooley-Tukey Factorization of DFT4
♦ Matrix formula representation
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅
−⋅⋅⋅⋅
⋅⋅−⋅⋅
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅
−⋅⋅⋅−⋅
⋅⋅⋅⋅
=
−−−−−−
11
11
1111
1111
11
1
1111
1111
111111
111111
iii
ii
( ) ( ) mnnmn
mnnmnmn LDFTIDIDFTDFT ⋅⋅
⋅ ⊗⊗=
4
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 7CMU/ECE/SPIRAL/Hoe
“Fast” Fourier Transform Algorithms♦ Recursively factorize by the Cooley-Tukey rule until only
leaf cases remain (e.g. DFTr for radix-r)
♦ Exponential number of alternatives
♦ Each ruletree corresponds a different algorithm♦ All cost O(N log(N))
( ) ( )( ) ( ) ( )( )( ) 8
24222
42222
8242
8242
82428
LLDFTIDIDFTIDIDFT
LDFTIDIDFTDFT
⊗⊗⊗⊗=
⊗⊗=
8DFT
2DFT 4DFT2DFT 2DFT
8DFT
4DFT 2DFT2DFT 2DFT
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 8CMU/ECE/SPIRAL/Hoe
Formula to HW (Combinational)
♦ Given where is:- apply , then - is a permutation permute- apply , times in parallel- is a diagonal scale
A
A
B A
×4
×2
×7
×8
5
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 9CMU/ECE/SPIRAL/Hoe
How about good HW?
♦ Matrix formulas have a natural mapping to dataflow and hence combinational datapath
♦ However, real hardware designs must fit a given resource constraint ⇒ sequential datapath that reuse available HW- identify repeated kernels- instantiate kernels under resource constraints- schedule computation to reuse instantiated kernels
We want to do the analysis, mapping and scheduling at formula level, with high-level algorithm knowledge
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 10CMU/ECE/SPIRAL/Hoe
♦ Simple regular structure embodied in Pease FFT
♦ Example:
Regular Structure for HW
6
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 11CMU/ECE/SPIRAL/Hoe
Pease DFT Example: DFT8
x
x
x
x
x
x
x
x
x
x
x
x
stage 1 stage 2 stage 3
(formula is applied from right to left)
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 12CMU/ECE/SPIRAL/Hoe
Horizontal folding
♦ a baseline datapath for DFT (sans bit-reverse)♦ degree of freedom: vertical parallelism
- parameter pp
inputbypass
register
pp
7
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 13CMU/ECE/SPIRAL/Hoe
Vertical (V-)folding according to p
latencyFineFine--grained control over cost/latency tradeoffgrained control over cost/latency tradeoff
cost
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 14CMU/ECE/SPIRAL/Hoe
DFTgen [Milder, et al., DAC’05, FPGA’06]
functionality
controlarea/latencytradeoff
resourcemodel
8
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 15CMU/ECE/SPIRAL/Hoe
Cost Performance Tradeoff
DFT64 DFT256 DFT1024
slices slices slices
late
ncy
(use
c)
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 16CMU/ECE/SPIRAL/Hoe
♦ Simple regular structure embodied in formula
Structures of Interest for Parameterized Cores
♦ Product: Horizontal folding
♦ Tensor Product: parallelism or streaming vertical folding
♦ Stride permutation: rewiring or buffer/delay
9
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 17CMU/ECE/SPIRAL/Hoe
Applicability to other transforms?
♦ DFT radix 2
♦ DFT radix 2r
♦ 2-D DFTnxn
♦ WHT
♦ DCT (type II) [ ] Hk
iik k
k
k
k
k
k
PLLLADP2
22
22
1
0
22 11 −−∏
−
=−
( )[ ]∏=
⊗1
0
2
inn
nn DFTIL
( )[ ]∏−
=−− ⊗
1
0
22222 11
k
ii
k
kkk LDFTITR
( )[ ]∏−
=−− ⊗
1/
0
22222
rk
ii
k
rkrrkk LDFTITR
( )[ ]∏−
=−− ⊗
1/
0
2222
rk
i
k
rkrrk LWHTI
Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 18CMU/ECE/SPIRAL/Hoe
Conclusions
♦ Mathematical approach to DSP transform implementation
♦ Encapsulating domain knowledge in a domain specific tool for 100% automatic design generation and optimization
♦ Generalizable to other linear DSP transforms
♦ Thank you (Arvind)