Parallel Systolic FFT Architectures

Embed Size (px)

Citation preview

  • 7/30/2019 Parallel Systolic FFT Architectures

    1/13

    ClariPhy Confidential

    Parallel Systolic FFT Architectures for High-Speed,

    High Throughput Frequency-Domain Filtering

    October 12, 2012

    Oscar E. Agazzi

  • 7/30/2019 Parallel Systolic FFT Architectures

    2/13

    ClariPhy Confidential 2

    Overview

    Introduction

    Systolic FFT architecture (radix 2)

    Parallel systolic architectures

    Storage requirements

    Other considerations

    Conclusions

  • 7/30/2019 Parallel Systolic FFT Architectures

    3/13

    ClariPhy Confidential 3

    Introduction (1)

    In this presentation we investigate high-speed, high throughputarchitectures for FFTs

    The main problem that it is desired to address is how to simplify the

    complex interconnection pattern resulting from butterflies in FFT

    implementations derived (directly or indirectly) from FFT flow diagrams

    Systolic architectures greatly simplify the interconnections, at the

    expense of increasing the storage requirements

    Systolic architecturesper se may not be sufficient to achieve the

    throughput and speed required by the BCD filter in the CL10010

    Systolic architectures may need to be combined with parallel processing

    and some degree of traditional, butterfly-based architectures

  • 7/30/2019 Parallel Systolic FFT Architectures

    4/13

    ClariPhy Confidential 4

    Introduction (2)

    The work presented here is largely based on the systolic FFTarchitecture described in reference [1], however no good references

    have been found on how to combine systolic implementations with

    parallel processing

    The approach presented here may be similar to the one described in [2],

    but that reference is not explicit enough to replicate its work For simplicity, in this presentation we consider only radix 2 FFTs,

    however additional savings may be achieved by using higher radix FFTs

    OLeary [1] reports that savings may be achieved by using radix 4

    transforms

  • 7/30/2019 Parallel Systolic FFT Architectures

    5/13

    ClariPhy Confidential 5

    Systolic FFT Architecture (radix 2)

    Delay 4 +

    - X Delay 2

    Delay 2 +

    - X Delay 1

    Delay 1 +

    -

    Top Output

    Bottom Output

    Input 1

    Input 2

    W0, W1, W2, W3 W0, W2

    Example for N=8

    N/2 3N/2N 2N 5N/2

    NEG A NEG B NEG C

    POS A POS B POS C

    BLOCK B BLOCK D

    BLOCK A BLOCK C

    I/O Timing

    FFT

    Size

    Memory

    (Complex

    Words)

    Complex

    Multipliers

    Complex

    Adders

    N ~3N/2 log2(N)-1 2log2(N)

    8 12 2 616 24 3 8

    32 48 4 10

    64 96 5 12

    128 192 6 14

    Complexity vs. FFT Size N

  • 7/30/2019 Parallel Systolic FFT Architectures

    6/13

    ClariPhy Confidential 6

    Discussion

    The systolic processor has an extremely simple interconnection pattern

    Although memory size grows linearly with N, it is quite manageable for

    N=64 or even N=128, which are the likely sizes for a parallel/systolic

    FFT processor for the CL10010 BCD filter

    Notice that the processor shown in the previous slide can process two

    independent FFTs at the same time

    The inputs must be skewed in time by N/2 (this requires additional

    buffering)

    The outputs come sequentially (aligning the outputs also requires

    additional buffering)

    The outputs come in bit reverse order

  • 7/30/2019 Parallel Systolic FFT Architectures

    7/13

    ClariPhy Confidential 7

    FFT Parallelization

    In the following discussion we use a numerical example to make thediscussion more concrete

    We assume that the FFT size is N=8192 and the desired throughput is

    64Gs/s

    We also assume that the input comes in blocks of consecutive samples

    of size D=128

    Therefore a complete FFT block of 8192 samples can be thought as a

    matrix of samples of 64 rows and 128 columns

    The FFT processor must accept blocks of 128 samples (where each block is

    a row of the matrix) at a rate of 500MHz

    The discussion can be easily generalized to other FFT sizes N and

    decimation factors D

  • 7/30/2019 Parallel Systolic FFT Architectures

    8/13

    ClariPhy Confidential 8

    FFT Parallelization (cont.)

    The parallelization of the FFT is based on the following factorization:

    This can be expressed as:

    Writing withp=0,,128 and q=0,,63, and observing

    that Xr(k) is periodic in k with period 64, we can write:

    Finally:

    Where the FFT is taken with respect to index r

    The implementation of this factorization is shown in the following slide

    =

    =

    8191

    0

    8192)()(n

    nkWnxkX

    =

    N

    jWN

    2exp

    )()128()(

    127

    0

    63

    0

    127

    0

    8192648192 kXWWrmxWkX rr m r

    rkmkrk

    = = ==+=

    qpk += 64

    )()()64( 8192

    127

    0

    128

    127

    0

    )64(

    8192 qXWWqXWqpX rrq

    r

    rp

    r

    r

    qpr ==

    +

    ==+

    { })()64( 8192128 qXWFFTqpX rrq

    =+

  • 7/30/2019 Parallel Systolic FFT Architectures

    9/13

    ClariPhy Confidential 9

    Parallel/Systolic Processor

    Serial

    toParallelConverter

    Input

    fs=64GHz

    FFT Leaf 0

    FFT Leaf 1

    FFT Leaf 63

    Scalers

    128PointFFT

    fD=500MHzfs=64GHz

    FFTOutput:6

    4blocksof12

    8sampleseach

  • 7/30/2019 Parallel Systolic FFT Architectures

    10/13

    ClariPhy Confidential 10

    Discussion

    The only complex interconnections in this processor occur in the 128-pointoutput FFT

    However, this FFT is relatively small so that its interconnections should not be a

    problem

    By comparison, consider that the BCD filter in the CL4010 uses an FFT size

    of 512

    The FFT required by the processor proposed here is 4 times smaller, and

    the technology is more advanced than in the CL4010

    The processor described here lends itself to an extremely regular and simple

    layout

    The output comes in the form of a matrix of complex numbers with 64 rows

    and 128 columns with both columns and rows in bit reverse order

    It is not necessary to reorder them because the IFFT can automatically reverse

    the order of both rows and columns

    Frequency domain filtering can be implemented in bit reverse order

  • 7/30/2019 Parallel Systolic FFT Architectures

    11/13

    ClariPhy Confidential 11

    Hardware Requirements

    Hardware Component Number of Units

    Memory (Complex Words) 10240

    Memory (Bits)

    (assumes average word length is 24 bits)491520

    Complex Multipliers 896

    Complex Adders 1216

    AssumptionsNumbers are per polarization and per FFT block

    Assuming 2 polarizations and IFFT similar to FFT, numbers in table should be

    quadrupled

    Pipeline registers not includedOutput FFT requires (N/2)log2(N) complex multipliers and equal number of

    complex adders

    Scaler requires 128 complex multipliers

  • 7/30/2019 Parallel Systolic FFT Architectures

    12/13

    ClariPhy Confidential 12

    Conclusions

    A systolic architecture can considerably simplify the routing of large block size,high throughput, high speed FFTs

    In deep submicron CMOS technologies, interconnections have a large impact

    on the power dissipation, therefore it is important to use regular architectures

    that lead to an efficient layout and to minimize interconnections

    In this presentation we have proposed an architecture that has the potential to

    meet the requirements of the CL10010

    However, significant work still needs to be done to explore alternative values of

    parameters, such as DSP clock speed, parallelization factor, size of the front-

    end FFTs (FFT Leaves) versus size of the back-end FFT, radices different from 2,

    etc.

    It is believed that this work can lead to a very efficient implementation of the

    BCD filter in the CL10010

  • 7/30/2019 Parallel Systolic FFT Architectures

    13/13

    ClariPhy Confidential 13

    References

    [1] G.C.OLeary, Nonrecursive Digital Filtering Using Cascad Fast Fourier Transformers, IEEETransactions on Audio and Electroacoustics, Vol. AU-18, No.2, June 1970, pp.177-183

    [2] P.Jackson et al, A Systolic FFT Architecture for Real Time FPGA Systems, MIT Lincoln

    Laboratory publication, September 29, 2004

    [3] T.Woodward, private communication

    [4] A.V.Oppenheim, Applications of Digital Signal Processing, Prentice Hall, 1978, Chapter 5

    (Applications of Digital Signal Processing to Radar)