Noval Memory reference reduction

Embed Size (px)

Text of Noval Memory reference reduction

  • 8/3/2019 Noval Memory reference reduction

    1/12

    2338 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

    Novel Memory Reference Reduction Methods forFFT Implementations on DSP Processors

    Yuke Wang, Yiyan (Felix) Tang, Yingtao Jiang, Member, IEEE, Jin-Gyun Chung, Member, IEEE,Sang-Seob Song, Member, IEEE, and Myoung-Seob Lim, Member, IEEE

    AbstractMemory references in digital signal processors (DSP)are expensive due to their long latencies and high power consump-tion. Implementing fast Fourier transform (FFT) algorithms onDSP involves many memory references to access butterfly inputsand twiddle factors. Conventional FFT implementations requireredundant memory references to load identical twiddle factorsfor butterflies from different stages in the FFT diagrams. In thispaper, we present novel memory reference reduction methods tominimize memory references due to twiddle factors for imple-menting various different FFT algorithms on DSP. The proposedmethods first group the butterflies with identical twiddle factorsfrom different stages in the FFT diagrams and compute thembefore computing other butterflies with different twiddle factors,and then reduce the number of twiddle factor lookups by takingadvantage of the properties of twiddle factors. Consequently, eachtwiddle factor is loaded only once and the number of memory ref-erences due to twiddle factors can be minimized. We have appliedthe proposed methods to implement radix-2 DIF FFT algorithmon TI TMS320C64x DSP. Experimental results show the proposedmethods can achieve average of 76.4% reduction in the numberof memory references, 53.5% saving of memory spaces due totwiddle factors, and average of 36.5% reduction in the numberof clock cycles to compute radix-2 DIF FFT on DSP comparingto the conventional implementation. Similar performance gain isreported for implementing radix-2 DIT FFT algorithms using thenew methods.

    Index TermsDigital signal processor (DSP), fast Fourier trans-form (FFT), memory reference.

    I. INTRODUCTION

    IN THE field of digital signal processing, the discrete Fourier

    transform (DFT) plays an important role in the analysis, de-

    sign, and implementation of discrete-time signal-processing al-

    Manuscript received September 16, 2004; revised June 16, 2006. Thisresearch was supported by the Ministry of Information and Communication(MIC), South Korea, under the Information Technology Research Center(ITRC) support program supervised by the Institute of Information Technology

    Assessment (IITA). The associate editor coordinating the review of thismanuscript and approving it for publication was Dr. Shuvra S. Bhattacharyya.

    Y. Wangis with theDepartment of ComputerScience,Erik Jonsson School ofEngineering and Computer Science, University of Texas at Dallas, Richardson,TX 75083-0688 USA (e-mail: [email protected]).

    Y. Tang was with the Department of Computer Science, Erik JonssonSchool of Engineering and Computer Science, University of Texas at Dallas,Richardson, TX 75083-0688 USA. He is now with the 3DSP Corporation,Irvine, CA 92618 USA (e-mail: [email protected]).

    Y. Jiang is with the Department of Electrical and Computer Engineering,University of Nevada, Las Vegas, Las Vegas, NV 89154-4026 USA (e-mail:[email protected]).

    J.-G. Chung, S.-S. Song, and M.-S. Lim are with the Division of Elec-trical and Information Engineering, Chonbuk National University, Jeonbuk561-756, Korea (e-mail: [email protected]; [email protected];[email protected]).

    Digital Object Identifier 10.1109/TSP.2007.892722

    gorithms and systems [1], [2]. For instance, the DFT can be used

    to calculate a signals frequency response, and to serve as an in-

    termediate step in more elaborate signal processing techniques.

    The DFT of a discrete signal can be directly computed by

    where , and are sequences of

    complex numbers, and .

    The fast Fourier transforms (FFTs) are a class of efficient al-gorithms to compute the DFT. The FFT algorithms are based on

    the principle of decomposing the computation of DFT into se-

    quences of smaller DFTs. The first efficient FFT algorithm was

    discovered by Gauss in the 18th century and rediscovered by

    Cooley and Tukey [3] in 1960s. Later advances in the research

    of FFT algorithms include the higher radix FFT [4], the mixed-

    radix FFT [5], the prime-factor FFT [6], Winograd (WFTA)

    FFT [7], the split-radix FFT [8], [9], the recursive FFT [10],

    and the combination of decimation-in-time (DIT) and decima-

    tion-in-frequency (DIF) FFT algorithms [11]. Most of these al-

    gorithms illustrate FFT with similar FFT diagrams, which are

    evolved from the recursive nature of the FFT algorithms andconstructed by basic butterfly structure, such as the 16-point

    radix-2 DIT FFT diagram shown in Fig. 1. The complex coef-

    ficient is called the twiddle factor in the

    butterfly structure in the FFT diagram.

    FFT algorithms can be implemented on multiple platforms.

    For example, FFT algorithms have been implemented on ap-

    plication-specific integrated circuits (ASIC) as FFT processors

    [12]. Hardware designs of FFT processors are often tailored

    to fit high-speed or low-power specifications but lack of flex-

    ibility. FFT algorithms have also been implemented by soft-

    ware on general-purpose processors as building block of sim-

    ulation or data processing systems [13]. Software-based im-

    plementations on general-purpose processors are flexible buttypically much slower than hardware implementations based

    on comparable hardware technologies. Digital signal proces-

    sors (DSPs) are a specific type of processors optimized for dig-

    ital signal processing applications such as FIR filters, IIR fil-

    ters, and FFT. Software implementations of FFT algorithms on

    DSPs are becoming more popular than ASIC and general-pur-

    pose processor-based implementations because they offer excel-

    lent tradeoffs among cost, performance, flexibility, and imple-

    mentation complexity.

    However, to effectively implement FFT algorithms on DSPs

    is not trivial. It has been recognized that memory references in

    DSP are expensive due to their long latencies and high power

    1053-587X/$25.00 2007 IEEE

  • 8/3/2019 Noval Memory reference reduction

    2/12

    WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 33 9

    Fig. 1. 16-pt radix-2 DIT FFT diagram. (a) Basic radix-2 DIT FFT butterfly. (b) Complete 16-pt radix-2 DIT FFT diagram.

    consumption. For example, in the TI TMS320C64x DSP [15],

    the memory load operation takes five pipeline execution phases

    to complete, which corresponds to four delay slots in the exe-

    cution time. The implementations of FFT algorithms on DSP

    involve many memory references to access butterfly inputs and

    twiddle factors. In general, an -pt radix-2 FFT diagram can

    be divided into stages, each of which contains a columnof butterflies. Conventional implementations of FFT algo-

    rithms compute butterflies in the natural order of the FFT dia-

    gram, i.e., the order of stages. The butterflies within each stage

    can be computed either in parallel or in serial. Many butterflies

    with identical twiddle factors can be found in multiple stages

    of the FFT diagrams. For example, seven butterflies with the

    twiddle factor can be found in Stage 2 to Stage 4 of the

    16-pt radix-2 DIT FFT diagram in Fig. 1(b). Hence, memory

    reference methods to load identical twiddle factors only once

    would reduce total memory reference time and reduce power

    consumption as well.

    In this paper, we propose novel memory reference reductionmethods to minimize the memory references due to twiddle

    factors in FFT implementations on DSP. The proposed methods

    first group the butterflies with identical twiddle factors from

    different stages of the FFT diagram and compute them before

    computing other butterflies with different twiddle factors.

    Hence, each twiddle factor is loaded only once and the re-

    dundant memory references for identical twiddle factors are

    removed. The memory reference reduction methods further

    take advantage of the properties of twiddle factors to reduce the

    number of twiddle factor lookups so that even more butterflies

    can be computed by loading one twiddle factor from memory.

    We have applied the memory reference reduction methods to

    implement the radix-2 DIF and DIT FFT algorithms on TITMS320C64x DSP. Experimental results show that the number

    of memory references and the amount of memory spaces for

    twiddle factors are greatly reduced, and the number of clock

    cycles to compute the radix-2 DIF FFT algorithm could also

    be reduced. Our methods can be applied to other kind of FFT

    algorithms as well.

    In the following, Section II gives the background of the

    DIF/DIT FFT algorithms and the example of a conventionalFFT implementation on DSP. Section III describes how to

    implement radix-2 DIF/DIT FFT algorithms on DSP with the

    memory reference reduction methods. Experimental results on

    TI TMS320C64x DSP are shown in Section IV and conclusions

    are drawn in Section V.

    II. BACKGROUND

    In this section, we will first briefly present basic ideas of the

    two most widely used FFT algorithms: the DIT FFT and the DIF

    FFT. We will then show the implementation of radix-2 DIF FFT

    from TIs DSP library [16] as a typical example of conventional

    implementation for FFT algorithms on DSP.

    The DFT of discrete signal can be directly computed as

    (1)

    where , and are sequences of

    complex numbers, and .

    DIT and DIF FFT algorithms are obtained by decomposing

    the input sequence and the output sequence in (1) into

    successively smaller subsequences, respectively. For example,

    the radix-2 DIT and DIF FFT algorithms can be obtained by

    splitting and into odd and even indexed terms, re-spectively. The computation of the of radix-2 DIT and DIF FFT

  • 8/3/2019 Noval Memory reference reduction

    3/12

    2340 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

    Fig. 2. (a) Basic radix-2 DIF FFT butterfly. (b) Complete 16-pt radix-2 DIF FFT diagram.

    algorithm can be represented by radix-2 DIT and DIF FFT dia-

    grams, which are shown in Fig. 1(b) and Fig. 2(b), respectively.

    The computation order of the butterflies in conventional FFT

    implementations on DSP is based on the partitioning of the FFT

    diagrams. In general, the FFT diagram can be partitioned into

    several stages. Each stage contains a constant number of butter-

    flies. For example, the -pt radix-2 DIT/DIF FFT diagram canbe partitioned into stages, each of which contains

    butterflies. The butterflies within a stage have no data depen-

    dencies with each other but have data dependencies with butter-

    flies in other stages. For example, the butterflies in Stage 2 of

    the FFT diagram in Fig. 2 have no data dependencies with each

    other but have data dependencies with butterflies in both Stage

    1 and Stage 3. The butterflies in the same stage of the FFT di-

    agram can be further partitioned into groups. Each group con-

    tains all butterflies sharing identical twiddle factors within the

    same stage. Particularly, the butterflies in the Stage of the -pt

    radix-2 DIT FFT diagram are divided into groups, while

    the Stage of -pt radix-2 DIF FFT diagram containsgroups. Fig. 3 illustrates the partitioning of the 16-pt radix-2

    DIT and DIF FFT diagrams.

    Based on the partitioning of the radix-2 DIT and DIF FFT dia-

    grams, the butterflies can be computed following the index order

    of the stages and groups. The butterflies in the same group are

    computed from top to bottom. Butterflies with identical twiddle

    factors are computed in multiple stages of the FFT diagrams

    in Fig. 3. For example, seven butterflies with the twiddle factor

    are computed in Stage 1 to Stage 3 of the 16-pt radix-2

    DIT FFT diagram in Fig. 3. Hence, identical twiddle factors are

    accessed multiple times in conventional FFT implementations.

    Fig. 4 shows the C code taken from TIs DSP library [15],

    which implements the -pt radix-2 DIF FFT algorithm, wherethe value of is given as an input to the C code.

    The C code in Fig. 4 shows a three-loop structure: 1) the

    outer-most loop, the -loop, counts the stages, loops for

    times; 2) the second outer loop, the -loop, counts the groups

    within each stage and decides which twiddle factor to be

    loaded; and 3) the inner-most loop, the -loop, computes the

    butterflies within each group. The and indicate the stage and

    group number, respectively. The and indicate the upper andlower input indexes of the butterfly computed by the inner-most

    loop and indicates the twiddle factor to be loaded. Since the

    conventional implementations strictly follow the natural order

    of the FFT diagram, identical twiddle factors are loaded mul-

    tiple times when computing butterflies from different stages of

    the FFT diagram. For example, the C code in Fig. 4 loads the

    twiddle factor at the -loop when computing butterflies

    in both Stage 1 and Stage 2 of the 16-pt radix-2 DIF FFT

    diagram.

    III. FFT IMPLEMENTATIONS WITH THE NOVEL MEMORYREFERENCE REDUCTION METHODS

    In order to remove redundant memory references due to iden-

    tical twiddle factors, we propose novel memory reference re-

    duction methods to implement FFT algorithms such that each

    twiddle factor is loaded only once by grouping butterflies with

    identical twiddle factors together. Furthermore, the proposed

    methods minimize the number of twiddle factors needed in FFT

    diagrams by taking advantage of properties of the twiddle fac-

    tors. The memory reference reduction methods work for im-

    plementations of many kinds of FFT algorithms. As examples,

    we will demonstrate applications of the memory reference re-

    duction methods on the two most popular FFT algorithms: theradix-2 DIF and DIT FFT algorithms.

  • 8/3/2019 Noval Memory reference reduction

    4/12

    WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 1

    Fig. 3. Partitioning of the 16-pt radix-2 DIT and DIF FFT. (a) Partitioningof 16-pt radix-2 DIT FFT diagram. (b) Partitioning of 16-pt radix-2 DIF FFT

    diagram.

    A. Grouping of Butterflies With Identical Twiddle Factors

    In this subsection, we will use the radix-2 DIF FFT diagram

    to demonstrate how to group and compute the butterflies with

    identical twiddle factors from different stages together.

    For the radix-2 DIF FFT algorithm, a butterfly in Stage is

    composed with the inputs and , and twiddle

    factor . Fig. 5 shows the butterfly at

    Stage of an -pt radix-2 DIF FFT diagram with the corre-

    sponding twiddle factor.

    For example, in the second stage of a 16-pt radix-2 DIFFFT diagram shown in Fig. 2, the butterfly with the input

    Fig. 4. C code of radix-2 DIF FFT from [15].

    Fig. 5. Single butterfly at Stage s in radix-2 DIF FFT diagram with twiddlefactor.

    and uses the twiddle factor

    .

    Theorem 1: In the Stage of the -pt radix-2 DIF FFT dia-

    gram, there are different twiddle factors that can be repre-

    sented by , where . Among them,

    twiddle factors of the form for ,will not show up in stage or any other late stages.

    The butterflies within a stage with identical twiddle factors

    can be grouped and computed in any order without destroying

    the data dependencies in the original radix-2 DIF FFT diagram.

    For example, the butterflies with twiddle factors , ,

    , and are only found in Stage 1 of the 16-pt radix-2

    DIF FFT diagram. These butterflies can be grouped and com-

    puted in any order without affecting the computations of other

    butterflies. In addition, the butterflies with twiddle factors

    and do not exist in any stage later than Stage 2 of the 16-pt

    radix-2 DIF FFT diagram. Hence, they can be grouped and com-

    puted in any order after the butterflies with twiddle factors ,, , and are computed. Similarly, the butterflies with

    twiddle factor can be grouped and computed in any order

    in the 16-pt radix-2 DIF FFT diagram after the butterflies with

    twiddle factors and arecomputed.The butterflies with

    the twiddle factor do not exist in stages later than Stage 3.

    Following this principle, the computation of the -pt radix-2

    DIF FFT diagram can be done in steps. Each step groups

    and computes the butterflies with the twiddle factor appears in

    all stages up to the stage of interest and will not occur in the

    future stages of the FFT diagram. The butterflies within a step

    can be computed in any order except for the butter flies with the

    twiddle factor . Butterflies with the twiddle factor ap-

    pear in all the stages of the FFT diagram and have data depen-dencies between the stages.

  • 8/3/2019 Noval Memory reference reduction

    5/12

    2342 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

    Fig. 6. Grouping butterflies with identical twiddle factors together in radix-2 DIF FFT diagram.

    Reduced Memory Reference FFT algorithm: Based on

    Theorem 1, the -pt radix-2 DIF FFT diagram can be

    computed in steps as the following.

    Step 1: Compute the butterflies with twiddle factors that

    will not occur after the Stage 1 of the FFT diagram.

    Compute the butterflies with twiddle factor

    where in the Stage 1 of the-pt radix-2 DIF FFT diagram.

    Step 2: Compute the butterflies with twiddle factors that

    will not occur after the Stage 2 of the FFT diagram.

    The butterflies with twiddle factors that will not occur

    after the Stage 2 of the -pt radix-2 DIF FFT diagram

    include: 1) butterflies in the second stage in the DIF

    FFT diagram and 2) butterflies in the first stage.

    These butterflies are with twiddle factors where

    .

    Step : Compute the butterflies with twiddle factors that

    will not occur after the Stage of the FFT diagram, where

    .

    The butterflies with twiddle factors that will not

    occur after the Stage of the -pt radix-2 DIF FFT

    diagram include butterflies in the Stage ,

    butterflies in the Stage , and butterflies

    in the first stage of the radix-2 DIF FFT diagram.

    These butterflies are with twiddle factors where

    .

    Step : Compute the butterflies with twiddle factor

    .

    Totally butterflies with twiddle factors in an

    -pt radix-2 DIF FFT diagram are computed.

    In this way, each twiddle factor is loaded exactly once duringthe computation of the -pt radix-2 DIF FFT diagram. To illus-

    trate the above steps, we can redraw the 16-pt radix-2

    DIF FFT diagram from Figs. 26, where the butterflies with

    identical twiddle factors are grouped together. All butterflies

    with the twiddle factor in Fig. 6 can be computed

    without multiplications in the Step .

    B. Reduction of the Number of Necessary Lookups of Twiddle

    Factors

    The method in Section III-A can reduce the number ofmemory accesses for each twiddle factor in implementing the

    -pt radix-2 DIF FFT algorithm. Furthermore, the memory

    references can be minimized by reducing the number of twiddle

    factors to be looked up using the properties of the twiddle fac-

    tors. For example, the butterflies in Step 2 of the FFT diagram

    in Fig. 6 are computed with twiddle factors and . The

    twiddle factor can be replaced by with a simple

    derivation

    Hence, only the twiddle factor is needed in Step 2. Simi-

    larly, twiddle factors and can be replaced by

    and , respectively. Hence, only twiddle factors

    and are necessary in Step 1 of the FFT diagram in Fig. 6.

    In general, we have the following property:

    (2)

  • 8/3/2019 Noval Memory reference reduction

    6/12

    WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 3

    Fig. 7. Computing two butterflies together in one stage of the radix-2 DIF FFTdiagram.

    where , ,

    .The above property of twiddle factors can be applied to any

    FFT algorithm and reduce the number of twiddle factors needed

    to store in memory. More butterflies canbe computed by loading

    one twiddle factor then grouping the butterflies with identical

    twiddle factors together.

    Theorem 2: Considering the two butterflies at Stage of

    radix-2 DIF FFT diagram shown in Fig. 7(a), both butterflies

    can be computed together by loading only one twiddle factor

    , as shown in Fig. 7(b).

    Proof: Based on Fig. 5, input in Stage

    pairs with input to form one butterfly, and the

    twiddle factor used in the butterfly is

    Since and

    , we have

    if , the result of

    the above equation is

    else if , the result

    becomes

    Therefore, we have

    , which implies that

    and

    The twiddle factors and

    are complex numbers with separated real and imaginary parts,

    which are stored separately in memory. Therefore, by loading

    and ,

    we can compute both butterflies.

    After the number of necessary twiddle factors is reduced,

    the FFT diagram in Fig. 6, where the butterflies with identical

    twiddle factors are grouped together, can be further redrawn in

    Fig. 8. Only three twiddle factors are needed to be looked up in

    Fig. 8 comparing to seven in the original 16-pt radix-2 DIF FFT

    diagram in Fig. 2.

    The method proposed in this subsection to reduce the number

    of twiddle factors to be looked up and to compute two butter-

    flies together by loading one twiddle factor is different from

    the radix-4 FFT algorithm [1]. The radix-4 FFT algorithm saves

    the complex multiplication with twiddle factors by combining

    two adjacent stages in the radix-2 FFT diagram. The method

    proposed in this subsection does not save the complex multi-

    plication in the radix-2 FFT diagram but saves the number of

    memory lookups needed for the twiddle factors. Moreover, the

    proposed method can be applied to radix-4 FFT algorithms to

    reduce the number of twiddle factor needed to be looked up in

    the radix-4 FFT diagram. Fig. 9 shows an example of reducing

    the number of necessary twiddle factors in the 16-pt radix-4 DIF

    FFT diagram.

    The twiddle factors in the conventional 16-pt radix-4 DIF

    FFT diagram shown in Fig. 9(c) are , , , , ,

    and . Bytaking advantagefrom the propertiesof the twiddle

    factors, the number of twiddle factors needed to be looked up in

    16-pt radix-4 DIF FFT diagram are reduced to , , and

    as shown in Fig. 9(d).

    C. Application of Memory Reference Reduction Methods on

    Radix-2 DIT FFT

    The new memory reference reduction methods can be ap-

    plied to implement many existing FFT algorithms. As an ex-ample, we apply the memory reference reduction methods to

  • 8/3/2019 Noval Memory reference reduction

    7/12

    2344 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

    Fig. 8. 16-pt radix-2 DIF FFT diagram after the proposed methods are applied.

    Fig. 9. Reducing the number of twiddle factors needed to be looked up in the 16-pt radix-4 DIF FFT diagram. (a) Radix-4 DIF FFT butterfly. (b) Simplifiedrepresentation of the radix-4 DIF FFT butterfly. (c) Conventional 16-pt radix-4 DIF FFT diagram. (d) 16-pt radix-4 DIF FFT diagram with the number of twiddlefactors to be looked up reduced.

    implement the radix-2 DIT FFT algorithm. Due to the differ-

    ence between the DIT and DIF FFT diagrams, the method de-scribed in Sections III-A and III-B cannot be applied to radix-2

    DIT FFT diagram directly. To apply the method described in

    Section III-A to an -pt radix-2 DIT FFT diagram, the butter-flies with twiddle factor are grouped and computed

  • 8/3/2019 Noval Memory reference reduction

    8/12

    WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 5

    Fig. 10. Grouping butterflies with identical twiddle factors together in radix-2 DIT FFT diagram.

    Fig. 11. 16-pt radix-2 DIT FFT diagram after the proposed methods are applied.

    together before computing butterflies with other twiddle factors

    in Step 1. In the following step , the butterflies with twiddle fac-

    tors , where

    , are computed. By grouping the butterflies with iden-

    tical twiddle factors together, we can redraw the 16-pt radix-2

    DIT FFT diagram in Fig. 10.

    To apply the method described in Section III-B to an -pt

    radix-2 DIT FFT diagram, we first compute all the butterflies

    with twiddle factor in Stage 1. Then, we compute the

    rest butterflies with twiddle factor and the butterflies

    with twiddle factor together.Atlast, the remaining

    butterflies in the FFT diagram are computed following the prin-

    ciple ofSection III-B. After the number of twiddle factors to belooked up is reduced, the FFT diagram in Fig. 10 can be further

    redrawn in Fig. 11. Only three twiddle factors are needed to be

    looked up in Fig. 11 comparing to seven in the original 16-pt

    radix-2 DIT FFT diagram in Fig. 1.

    IV. PERFORMANCE EVALUATION FOR THE NOVEL MEMORY

    REFERENCE REDUCTION METHODS

    The number of memory references due to twiddle factors in

    conventional implementations of -pt radix-2 DIF or DIT FFT

    algorithms is , whichequalsto the number ofthe groupsin

    the FFT diagram. Grouping the butterflies with identical twiddle

    factors together reduces the number of memory references due

    to twiddle factors from to . After the number of necessary twiddle factors being minimized by the properties of

  • 8/3/2019 Noval Memory reference reduction

    9/12

    2346 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

    Fig. 12. DIF FFT code 1 groups the butterflies with identical twiddle factors together only.

    the twiddle factors, only memory referencesfor twiddle

    factors are needed to implement -pt radix-2 FFT algorithms.

    We have applied the memory reference reduction methods to

    implement the radix-2 DIF and DIT FFT on TI TMS320C64x

    DSP, which is a fixed-point DSP with enhanced very long in-

    struction word (VLIW) architecture. The C64x DSP has eight

    functional units that can execute a maximum of eight operations

    in parallel, two register files with each 32 32-bit registers, and

    32-bit internal communication bandwidth.

    Four pieces of C codes are compiled with the maximum com-

    piler effort (-o3) and executed in the TI Code Composer Studio

    (CCS) v2.1 [14], which is the software development and sim-ulation environment for TI TMS320C64x DSP. The TIs DIF

    FFT code is the radix-2 DIF FFT code in Fig. 4 taken from TIs

    DSP library [15]. The DIF FFT code 1 in Fig. 12 only groups

    the butterflies with identical twiddle factors together in radix-2

    DIF FFT diagram without reducing the number of twiddle fac-

    tors needed to be looked up. The DIF FFT code 2 in Fig. 13 is

    written based on the radix-2 DIF FFT diagram in Fig. 8, where

    the memory reference reduction methods are applied. Besides

    the above three codes, the radix-2 DIT FFT code with memory

    reference reduction methods is shown in Fig. 14, which is based

    on the radix-2 DIT FFT diagram in Fig. 11. The performance

    figures of the four codes are compared in Table I, including

    the number of memory references due to twiddle factors, theamount of memory storage for twiddle factors, and the number

    of clock cycles to compute FFT for FFTs with different sizes.

    The number of clock cycles for all code to compute the FFTs

    are precisely measured using the break point function in CCS.

    The experimental results show that the radix-2 DIF FFT

    algorithm implementation with grouping of the butterflies with

    identical twiddle factors together alone can achieve average of

    50.9% reduction in the number of memory references due to

    twiddle factors and average of 29.7% reduction in the number

    of clock cycles comparing to the conventional implementation

    taken from TIs library. Furthermore, when the number of

    twiddle factors needed to be looked up is also reduced, average

    of 76.4% reduction in the number of memory references due totwiddle factors, average of 53.5% of memory spaces saving for

    twiddle factors, and average of 36.5% reduction in the number

    of clock cycles can be achieved comparing to the conventional

    implementation taken from TIs library. The performance of

    the radix-2 DIT FFT algorithm implementation with memory

    reference reduction methods is slightly better than the radix-2

    DIF FFT algorithm implementation.

    V. CONCLUSION

    In this paper, we propose novel memory reference reduction

    methods to minimize the number of memory references due to

    twiddle factors in FFT implementations on DSP. The proposed

    methods first group the butterflies with identical twiddle factorsfrom different stages in the FFT diagram and compute them

  • 8/3/2019 Noval Memory reference reduction

    10/12

    WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 7

    Fig. 13. DIF FFT code 2 with the memory reference reduction methods based on Fig. 8.

    Fig. 14. DIT FFT code with the memory reference reduction methods based on Fig. 11.

    together, and then reduce the total number of necessary twiddle

    factors by taking advantage from the properties of twiddlefactors. Consequently, each twiddle factor is loaded only once

    and the number of memory references due to twiddle factors

    can be minimized. Experimental results show the proposedmethods can achieve average of 76.4% reduction in the number

  • 8/3/2019 Noval Memory reference reduction

    11/12

    2348 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

    TABLE I

    PERFORMANCE COMPARISON OF THE IMPLEMENTATIONS

    of memory references, 53.5% saving of memory spaces due to

    twiddle factors, and average of 36.5% reduction in the number

    of clock cycles to compute radix-2 DIF FFT on DSP comparing

    to conventional implementation.

    ACKNOWLEDGMENT

    The authors would like to thank the anonymous reviewers for

    their careful reading and valuable comments that improved the

    quality of this paper. A reviewer hasalso brought to our attentionthat C. M. Rader of MIT, in 1965, wrote an FFT program which

    used the idea in Section III-A, but he did not publish it.

    REFERENCES

    [1] C. S. Burrus and T. W. Parks, DFT/FFT and Convolution Algorithmsand Implementation. New York: Wiley, 1985.

    [2] A. V. Oppenheim and C. M. Rader, Discrete-Time Signal Processing,2nd ed. Upper Saddle River, NJ: Prentice-Hall, 1999, 0137549202.

    [3] J.W.CooleyandJ.W.Tukey, Analgorithmforthemachinecalculationof complex Fourier series,Math. Comput., vol.19, pp. 297301,1965.

    [4] G. D. Bergland, A radix-eight fast-Fourier transform subroutine forreal-valued series, IEEE Trans. Electroacoust., vol. AE-17, no. 2, pp.138144, Jun. 1969.

    [5] R. C. Singleton, An algorithm for computing the mixed radix fastFourier transform, IEEE Trans. Audio Electroacoust., vol. AE-17, no.2, pp. 93103, Jun. 1969.

    [6] D. P. Kolba and T. W. Parks, A prime factor FFT algorithmusing high-speed convolution, IEEE Trans. Acoust., Speech, SignalProcess., vol. ASSP-25, no. 4, pp. 281294, Aug. 1977.

    [7] S. Winograd, On computing the discrete Fourier transform, Math.Comput., vol. 32, no. 141, pp. 175199, Jan. 1978.

    [8] P. Duhamel and H. Hollmann, Split radix FFT algorithm, Electron.Lett., vol. 20, pp. 1416, Jan. 5, 1984.

    [9] D. Takahashi, An extended split-radix FFT algorithm, IEEE SignalProcess. Lett., vol. 8, no. 5, pp. 145147, May 2001.

    [10] A. R. Varkonyi-Koczy, A recursive fast Fourier transform algorithm,IEEE Trans. Circuits Syst. II, vol. 42, no. 9, pp. 614616, Sep. 1995.

    [11] A. Saidi, Decimation-in-time-frequency FFT algorithm, in Proc.ICASSP, Apr. 1994, pp. III:453III:456.

    [12] B. M. Baas, A low-power, high-performance, 1024-point FFT pro-cessor, IEEE J. Solid-State Circuits, vol. 34, no. 3, pp. 380387, Mar.1999.

    [13] Matlab Function ReferenceFFT. Mathworks, Inc. [Online]. Avail-able: http://www.mathworks.com/access/helpdesk/help/techdoc/ref/fft.shtml?BB=1

    [14] TMS320C6000 Programmers Guide (Rev. G), Texas Instrument,Aug. 1, 2002, SPRU198G.

    [15] TMS320C64x DSP Library Programmers Reference (Rev. B), TexasInstrument, Oct. 23, 2003, SPRU565A.

    Yuke Wang received the B.Sc. degree from theUniversity of Science and Technology of China,Hefei, China, in 1989, the M.Sc. degree and thePh.D. degree from the University of Saskatchewan,Saskatoon, Canada, in 1992 and 1996, respectively.

    He has held faculty positions at Concordia Uni-

    versity, Montreal, QC, Canada, and Florida AtlanticUniversity, Boca Raton. Currently, he is an Associate

    Professor in the Computer Science Department, Uni-versity of Texas at Dallas, Richardson. He has alsoheld visiting assistant professor positions in the Uni-

    versity of Minnesota, the University of Maryland, and the University of Cali-fornia at Berkeley. His research interests include VLSI design of circuits andsystems for DSP and communication, computer-aided design, and computer ar-chitectures. He has published more than 20 papers in IEEE/ACM Transactions.

    Dr. Wang served as an Associate Editor of the IEEE TRANSACTIONS ONCIRCUITS AND SYSTEMS, PART II (20022003), as an Editor of the IEEETRANSACTIONS ON VLSI SYSTEMS (20012002), as an Editor of AppliedSignal Processing, and a few other journals.

    Yiyan (Felix) Tang received the B.Sc. degree inelectrical engineering from South China Universityof Technology, Guangzhou, China, in 2000, and theM.Sc. in computer engineering and the Ph.D. degreein computer science from the University of Texas atDallas, Richardson, in 2002 and 2005, respectively.

    Since 2005, he has been with the 3DSP Corpora-tion, Irvine, CA, where he works on design and im-plementation of wireless communication systems on

    digital signal processors. His current research inter-ests lie in efficient and effective design and imple-

    mentation of wireless communication and signal processing systems on digitalsignal processors.

    Yingtao Jiang (M01) received the B.Eng. de-gree in biomedical engineering and electronicsfrom Chongqing University, Chongqing, China,the M.A.Sc. degree in electrical engineering fromConcordia University, Montreal, QC, Canada, andthe Ph.D. degree in computer science from theUniversity of Texas at Dallas, Richardson, in 1993,1997, and 2001, respectively.

    He is currently an Assistant Professor in the De-partment of Electrical and Computer Engineering,University of Nevada, Las Vegas. His research

    interests include algorithms, VLSI architectures, and circuit-level techniques

    for the design of DSP, networking, and telecommunications systems, computerarchitectures, and biomedical signal processing, instrumentation, and medicalinformatics.

  • 8/3/2019 Noval Memory reference reduction

    12/12

    WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 9

    Jin-Gyun Chung (S90M98) received the B.S.degree in electronic engineering from ChonbukNational University, Chonju, Korea, in 1985 andthe M.S. and Ph.D. degrees in electrical engineeringfrom the University of Minnesota, Minneapolis, in1991 and 1994, respectively.

    Since 1995, he has been with the Department ofElectronic and Information Engineering, Chonbuk

    National University, where he is currently a Pro-fessor. His research interests are in the area of VLSIarchitectures and algorithms for signal processing

    and communication systems, which include the design of high-speed andlow-power algorithms for arithmetic circuits, OFDM systems, and communi-

    cation systems for automobiles.

    Sang-Seob Song (S78M81) received the B.S. de-gree in electrical engineering from Chonbuk NationalUniversity in 1978 and the M.S. and Ph.D. degrees inelectrical and computer engineering from the KoreaAdvanced Institute of Science and Technology, Dae-

    jeon, Korea, and the University of Manitoba, Win-nipeg, MB, Canada, in 1980 and 1990, respectively.

    Since 1981, he has been with the Department ofElectronic and Information Engineering, ChonbukNational University, Jeonbuk, Korea, where he is

    currently a Professor. His research interests are inthe area of high-speed modems which includes channel coding and modulation.

    Myoung-Seob Lim (S85M90) received the B.S.degree in electronic engineering from Yeonsei Uni-versity, Seoul, Korea, in 1980 and the M.S. and Ph.D.degrees in electrical engineering from YonseiUniver-sity in 1982 and 1990, respectively.

    He has worked at the Elecronic Telecommuni-cation Research Institute from 1985 to 1996. Since1996, he has been with the Department of Electronic

    and Information Engineering, Chonbuk NationalUniversity, Jeonbuk, Korea, where he is currentlya Professor. His research interests are in the area

    of design of CDMA and OFDM communication systems, which include theperformance analysis, bandwidth efficient modulation, and synchronization,and also CAN for In Vehicle Networks.