Efficient FFTs On VIRAM

Efficient FFTs On VIRAM

Randi Thomas and Katherine Yelick

Computer Science DivisionUniversity of California,

Berkeley

IRAM Winter 2000 Retreat

{randit, yelick} @cs.berkeley.edu

Outline

• What is the FFT and Why Study it?

• VIRAM Implementation Assumptions

• About the FFT

• The “Naïve” Algorithm

• 3 Optimizations to the “Naïve” Algorithm

• 32 bit Floating Point Performance Results

• 16 bit Fixed Point Performance Results

• Conclusions and Future Work

What is the FFT?

The Fast Fourier Transform converts

a time-domain function

into

a frequency spectrum

Why Study The FFT?• 1D Fast Fourier Transforms (FFTs) are:

– Critical for many signal processing problems

– Used widely for filtering in Multimedia Applications

» Image Processing

» Speech Recognition

» Audio & video

» Graphics

– Important in many Scientific Applications

– The building block for 2D/3D FFTs

All of these are VIRAM target applications!

Outline



• About the FFT






• System on the chip:

– Scalar processor: 200 MHz “vanilla” MIPS core

– Embedded DRAM: 32MB, 16 Banks, no subbanks

– Memory Crossbar: 25.6 GB/s

– Vector processor: 200 MHz

– I/O: 4 x 100 MB/sec

VIRAM Implementation Assumptions

VIRAM Implementation Assumptions

• Vector Processor has four 64-bit pipelines=lanes

– Each lane has:» 2 integer functional units» 1 floating point functional unit

– All functional units have a 1 cycle multiply-add operation

– Each lane can be subdivided into:» two 32-bit virtual lanes » four 16-bit virtual lanes

64-bits 64-bits64-bits64-bitsLANE 1 LANE 3 LANE 4LANE 2

32-bits 32-bits32-bits 32-bits32-bits 32-bits32-bits32-bitsVL 1 VL 2 VL 3 VL 4 VL 5 VL 6 VL 7 VL 8

16 16 16 16 16 16 16 1616 16 16 1616 16 16 16VL 1 VL 2 VL 3 VL 4 VL 5 VL 6 VL 7 VL 8 VL 9VL 10VL 11VL 12 VL 13VL 14VL 15VL 16

Peak Performance• Peak Performance of This VIRAM

Implementation

• Implemented:– A 32 bit Floating point version (8 lanes, 8 FUs)– A 16 bit Fixed point version (16 lanes, 32 FUs)

Nomultiply-adds

Allmultiply-adds

Nomultiply-adds

Allmultiply-adds

Nomultiply-adds

Allmultiply-adds

Operationsper

Cycle

PeakPerformance

16 Floating Point

8 Floating Point

32Integer

16Integer

64Integer

32Integer

3.2GFLOP/s

1.6GFLOP/s

6.4GOP/s

3.2GOP/s

12.8GOP/s

6.4GOP/s

32-bit Single Precision 32-bit Integer 16-bit Integer

Outline



• About the FFT






Computing the DFT (Discrete FT)

• Given the N-element vector x, its 1D DFT is another N-element vector y, given by formula:

– where = the jkth root of unity

– N is referred to as the number of points

• The FFT (Fast FT)– Uses algebraic Identities to compute DFT in

O(NlogN) steps

– The computation is organized into log2N stages » for the radix 2 FFT

)1(,...,1,0 Nj

NijkjkN e /2

j

N

k

jkNj xy

1

0

• Basic computation for a radix 2 FFT:

• The basic computation on VIRAM for Floating Point data points: – 2 multiply-adds + 2 multiplies + 4 adds =– 8 operations

• 2 GFLOP/s is the VIRAM Peak Performance for this mix of instructions

– Xi are the data points– is a “root of unity”

X0

XN/2

= X0 + *XN/2

= X0 - *XN/2

.

.

.

.

0X

N/2X

Computing A Complex FFT

Vector Terminology

• The Maximum Vector Length (MVL):– The maximum # of elements 1 vector register

can hold– Set automatically by the architecture– Based on the data width the algorithm is using:

» 64-bit data, MVL = 32 elements/vector register» 32-bit data, MVL = 64 elements/vector register» 16-bit data, MVL = 128 elements/vector register

• The Vector Length (VL):– The total number of elements to be computed – Set by the algorithm: the inner for-loop

One More (FFT) Term!

• A butterfly group (BG):

– A set of elements that can be computed upon in 1 FFT stage using:

» The same basic computation

AND

» The same root of unity

– The number of elements in a stage’s BG determines the Vector Length (VL) for that stage

Outline



• About the FFT






vr1

vr2

vr1

vr2

vr1vr2

vr1

vr2

Stage 1VL = 8

Stage 2VL = 4

Stage 4VL = 1

Stage 3VL = 2

Time

Cooley-Tukey FFT Algorithm

vr1+vr2=1 butterfly group; VL = vector length

– Diagram illustrates “naïve” vectorization

– A stage vectorizes well when VL MVL

– Poor HW utilization when VL is small ( MVL)

– Later stages of the FFT have shorter vector lengths:

» the # of elements in one butterfly group is smaller in the later stages

vr1

vr2

vr1

vr2

vr1vr2

vr1

vr2

Stage 1VL = 8

Stage 2VL = 4

Stage 4VL = 1

Stage 3VL = 2

Time

Vectorizing the FFT

Stage #

1 2 3 4 5 6 7 8 9 10

MF

LO

PS

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

1024 points512 points256 points128 points

IRAM Peak Performance (2000 MFLOPS)

VL=8=#lanes

Naïve Algorithm: What Happens When Vector Lengths

Get Short?

• Performance peaks (1.4-1.8 GFLOPs) if vector lengths are MVL • For all FFT sizes, 94% to 99% of the total time is spent doing the

last 6 stages, when VL MVL (= 64)– For 1024 point FFT, only 60% of the work is done in the last 6 stages

• Performance significantly drops when vector lengths # lanes (=8)

32 bit Floating Point

VL=64=MVL

Outline



• About the FFT






Optimization #1: Add auto-increment

• Automatically adds an increment to the current address in order to obtain the next address

• Auto-increment helps to:– Reduce the scalar code overhead

• Useful:– To jump to the next butterfly group in an FFT

stage– For processing a sub-image of a larger image

in order to jump to the appropriate pixel in next row

FFT Size (#points)

0 200 400 600 800 1000 1200

MF

LO

PS

0

50

100

150

200

250

NO auto-incrementauto-increment

1024512

256

128

64

32

16

8

4

Optimization #1: Add auto-increment

– Small gain from auto-increment

» For 1024 point FFT:•202 MFLOP/s w/o AI

•225 MFLOP/s with AI

– Still 94-99% of the time spent in last 6 stages where the VL 64

– Conclusion: Auto-increment helps, but scalar overhead is not the main source of the inefficiency


Optimization #2: Memory Transposes

• Reorganize the data layout in memory to maximize the vector length in later FFT stages– View the 1D vector as a 2D matrix– Reorganization is equivalent to a matrix

transpose

• Transposing the data in memory only works for N (2 * MVL)

• Transposing in memory adds significant overhead– Increased memory traffic

» cost too high to make it worthwhile

– Multiple transposes exacerbate the situation:Number of Transposes NeededFFT Sizes > 2048

1 512 - 2048

2 2563

1285

Optimization #3: Register Transposes

• Rearrange the elements in the vector registers– Provides a way to swap elements between 2

registers– What we want to swap (after stage 1 VL = MVL

= 8): Stage 2: SWAPvr10 1 2 3 4 5 6 7

vr2 8 9 10 11 12 13 14 15

– This behavior is hard to implement with one instruction in hardware

SWAPStage 3:

SWAP

vr10 1 2 3 8 9 10 11

vr2 4 5 6 7 12 13 14 15

SWAPStage 4: SWAP

vr10 1 4 5 8 9 12 13

vr2 2 3 6 7 10 11 14 15

VL = 4BGs= 2

VL = 2BGs= 4


• Two instructions were added to the VIRAM Instruction Set Architecture (ISA): – vhalfup and vhalfdn: both move elements one-

way between vector registers

• Vhalfup/dn:– Are extensions of already existing ISA support

for fast in-register reductions– Required minimal additional hardware support

» mostly control lines

– Much simpler and less costly than a general element permutation instruction

» Rejected in the early VIRAM design phase

– An elegant, inexpensive, powerful solution to the short vector length problem of the later stages of the FFT


•vhalfup8 9 10 11 12 13 14 15 vr2

vr1 0 1 2 3 8 9 10 11

•move

0 1 2 3 4 5 6 7 vr1

0 1 2 3 4 5 6 7 vr3

vr3 0 1 2 3 4 5 6 7

vr2 4 5 6 7 12 13 14 15

•vhalfdn

Stage 1:

SWAP

vr10 1 2 3 4 5 6 7

vr2 8 9 10 11 12 13 14 15

• Three steps to swap elements:– Copy vr1 into vr3– Move vr2’s low to vr1’s high (vhalfup)

» vr1 now done– Move vr3’s high to vr2’s low (vhalfdn)

» vr2 now done

Optimization #3: Final Algorithm • The optimized algorithm has two phases:

– Naïve algorithm is used for stages whose VL MVL– Vhalfup/dn code is used on:

» Stages whose VL MVL = the last log2 (MVL) stages

• Vhalfup/dn:– Eliminates short vector length problem

» Allows all vector computations to have VL equal to MVL•Multiple butterfly groups done with 1 basic operation

– Eliminates all loads/stores between these stages

• Optimized vhalf algorithm does: – Auto-increment, software pipelining, code scheduling– the bit reversal rearrangements of the results– Single precision, floating point, complex, radix-2

FFTs

Stage #

1 2 3 4 5 6 7 8 9 10

MF

LO

PS

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

1024 points512 points256 points128 points

IRAM Peak Performance (2000 MFLOPS)

VL=8=#lanes

• Every vector instruction operates with VL=MVL– For all stages – Keeps the vector pipeline fully utilized

• Time spent in the last 6 stages – drops to 60% to 80% of the total time



Outline



• About the FFT






Size (#points in FFT)

1024512256128

Tim

e (m

icro

seco

nd

s)

0

50

100

150

200

250Naive Naive, no bit reversal

Vhalfup/dn

• Both Naïve versions utilize the auto-increment feature– 1 does bit reversal, the other does not

• Vhalfup/dn with and without bit reversal are identical• Bit reversing the results slows naïve algorithm, but not

vhalfup/dn

Performance Results



1024512256128

Tim

e (m

icro

seco

nd

s)

0

50

100

150

200

250Naive Naive, no bit reversal

Vhalfup/dn

• The performance gap testifies: – To the effectiveness of the vhalfup/dn algorithm in

fully utilizing the vector unit– The importance of the new vhalfup/dn instructions

Performance Results



1024512256128

Tim

e (m

icro

seco

nd

s)

0

50

100

150

200

250Naive Naive, no bit reversalVhalfup/dn

TMS320C67x: 124 us

TigerSHARC: 41 us

CRI Pathfinder-1: 22.3 usCRI Pulsar: 27.9 usWildstar: 25 us

PPC604e: 87 us

Pentium/200: 151 us

VIRAM: 37 us

• VIRAM is competitive with high-end specialized Floating Point DSPs – Could match or exceed the performance of these

DSPs if the VIRAM architecture were implemented commercially

Performance Results


Outline



• About the FFT






16 bit Fixed Point Implementation

• Resources:– 16 lanes (each 16 bits wide)

» Two Integer Functional Units per lane» 32 Operations/Cycle

– MVL = 128 elements

• Fixed Point Multiply-Add not utilized

– 8 bit operands too small

» 8 bits * 8 bits = 16 bit product

– 32 bit product too big

» 16 bits * 16 bits = 32 bit product

16 bit Fixed Point Implementation (2)

• The basic computation takes: – 4 multiplies + 4 adds + 2 subtracts = 10

operations – 6.4 GOP/s is Peak Performance for this mix

• To prevent overflow two bits are shifted right and lost for each stage

InputSbbb bbbb bbbb bbbb.

Output Sbbb bbbb bbbb bbbb bb.

Decimal points

Shifted out


1024512256128

Tim

e (m

icro

seco

nd

s)

0

50

100

150

200

Fixed Point (16 bit)Floating Point (32 bit)

TMS320C67x: 124 us

TigerSHARC: 41 us

CRI Pathfinder-1: 22.3 usCRI Pulsar: 27.9 usWildstar: 25 us

PPC604e: 87 us

Pentium/200: 151 us

VIRAM: 37 us

• Fixed Point is Faster than Floating point on VIRAM– 1024 pt = 28.3 us verses 37 us

• This implementation attains 4 GOP/s for 1024 pt FFT and is:

– An Unoptimized work in progress!

Performance Results

16 bit Fixed Point


1024512256128

Tim

e (m

icro

seco

nd

s)

0

10

20

30

40

50

Fixed Point (16 bit)Floating Point (32 bit)

TigerSHARC: 4.4 us (Fixed Pt.)Pentium III (400MHz): 4.64 (16 bit Int)

CRI Pathfinder-1: 22.3 us

CRI Pulsar: 27.9 us

Wildstar: 25 us

VIRAM-FP: 37 us

TigerSHARC: 41 us (Floating Pt.)

• Again VIRAM is competitive with high-end specialized DSPs

– CRI Scorpio 24 bit complex fixed point FFT DSP: » 1024 pt = 7 microseconds

Performance Results

16 bit Fixed Point

Outline



• About the FFT






Conclusions

• Optimizations to eliminate short vector lengths are necessary for doing the FFT

• VIRAM is capable of performing FFTs at performance levels comparable to or exceeding those of high-end floating point DSPs. It achieves this performance via:– A highly tuned algorithm designed specifically for

VIRAM– A set of simple, powerful ISA extensions that

underlie it– Efficient parallelism of vector processing

embedded in a high-bandwidth on-chip DRAM memory

Conclusions (2)

• Performance of FFTs on VIRAM has the potential to improve significantly over the results presented here:– 32-bit fixed point FFTs could run up to 2 times faster than

floating point versions

– Compared to 32-bit fixed point FFTs, 16-bit fixed point FFTs could run up to:

» 8x faster (with multiply-add ops)

» 4x faster (with no multiply-add ops)

– Adding a second Floating Point Functional Unit would make floating point performance comparable to the 32-bit Fixed Point performance.

– 4 GOP/s for Unoptimized Fixed Point implementation (6.4 GOP/s is peak!)

Conclusions (3)• Since VIRAM includes both general-

purpose CPU capability and DSP muscle, it shares the same space in the emerging market of hybrid CPU/DSPs as:– Infineon TriCore– Hitachi SuperH-DSP– Motorola/Lucent StarCore– Motorola PowerPC G4 (7400)

• VIRAM’s vector processor plus embedded DRAM design may have further advantages over more traditional processors in:– Power– Area– Performance

Future Work

• On Current Fixed Point implementation:– Further optimizations and tests

• Explore the tradeoffs between precision & accuracy and Performance by implementing:– A Hybrid of the current implementation which

alternates the number of bits shifted off each stage

» 2 1 1 1 2 1 1 1...– A 32 bit integer version which uses 16 bit data

» If data occupies the 16 most significant bits of the 32 bits, then there are 16 zeros to shift off:

Sbbb bbbb bbbb bbbb b000 0000 0000 0000 0000

Backup Slides


1024512256128

Tim

e (m

icro

seco

nd

s)

0

50

100

150

200

250Naive without autoincrementNaive with autoincrement

Why Vectors For IRAM?• Low complexity architecture

– means lower power and area

• Takes advantage of on-chip memory bandwidth– 100x bandwidth of Work Station memory hierarchies

• High performance for apps w/ fine-grained ||ism• Delayed pipeline hides memory latency

– Therefore no cache is necessary» further conserves power and area

• Greater code density than VLIW designs like:– TI’s TMS320C6000– Motorola/Lucent StarCore – AD’s TigerSHARC– Siemens (Infineon) Carmel

Documents

Efficient FFTs On VIRAM