Noval Memory reference reduction

8/3/2019 Noval Memory reference reduction

1/12

2338 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007

Novel Memory Reference Reduction Methods forFFT Implementations on DSP Processors

Yuke Wang, Yiyan (Felix) Tang, Yingtao Jiang, Member, IEEE, Jin-Gyun Chung, Member, IEEE,Sang-Seob Song, Member, IEEE, and Myoung-Seob Lim, Member, IEEE

AbstractMemory references in digital signal processors (DSP)are expensive due to their long latencies and high power consump-tion. Implementing fast Fourier transform (FFT) algorithms onDSP involves many memory references to access butterfly inputsand twiddle factors. Conventional FFT implementations requireredundant memory references to load identical twiddle factorsfor butterflies from different stages in the FFT diagrams. In thispaper, we present novel memory reference reduction methods tominimize memory references due to twiddle factors for imple-menting various different FFT algorithms on DSP. The proposedmethods first group the butterflies with identical twiddle factorsfrom different stages in the FFT diagrams and compute thembefore computing other butterflies with different twiddle factors,and then reduce the number of twiddle factor lookups by takingadvantage of the properties of twiddle factors. Consequently, eachtwiddle factor is loaded only once and the number of memory ref-erences due to twiddle factors can be minimized. We have appliedthe proposed methods to implement radix-2 DIF FFT algorithmon TI TMS320C64x DSP. Experimental results show the proposedmethods can achieve average of 76.4% reduction in the numberof memory references, 53.5% saving of memory spaces due totwiddle factors, and average of 36.5% reduction in the numberof clock cycles to compute radix-2 DIF FFT on DSP comparingto the conventional implementation. Similar performance gain isreported for implementing radix-2 DIT FFT algorithms using thenew methods.

Index TermsDigital signal processor (DSP), fast Fourier trans-form (FFT), memory reference.

I. INTRODUCTION

IN THE field of digital signal processing, the discrete Fourier

transform (DFT) plays an important role in the analysis, de-

sign, and implementation of discrete-time signal-processing al-

Manuscript received September 16, 2004; revised June 16, 2006. Thisresearch was supported by the Ministry of Information and Communication(MIC), South Korea, under the Information Technology Research Center(ITRC) support program supervised by the Institute of Information Technology

Assessment (IITA). The associate editor coordinating the review of thismanuscript and approving it for publication was Dr. Shuvra S. Bhattacharyya.

Y. Wangis with theDepartment of ComputerScience,Erik Jonsson School ofEngineering and Computer Science, University of Texas at Dallas, Richardson,TX 75083-0688 USA (e-mail: [email protected]).

Y. Tang was with the Department of Computer Science, Erik JonssonSchool of Engineering and Computer Science, University of Texas at Dallas,Richardson, TX 75083-0688 USA. He is now with the 3DSP Corporation,Irvine, CA 92618 USA (e-mail: [email protected]).

Y. Jiang is with the Department of Electrical and Computer Engineering,University of Nevada, Las Vegas, Las Vegas, NV 89154-4026 USA (e-mail:[email protected]).

J.-G. Chung, S.-S. Song, and M.-S. Lim are with the Division of Elec-trical and Information Engineering, Chonbuk National University, Jeonbuk561-756, Korea (e-mail: [email protected]; [email protected];[email protected]).

Digital Object Identifier 10.1109/TSP.2007.892722

gorithms and systems [1], [2]. For instance, the DFT can be used

to calculate a signals frequency response, and to serve as an in-

termediate step in more elaborate signal processing techniques.

The DFT of a discrete signal can be directly computed by

where , and are sequences of

complex numbers, and .

The fast Fourier transforms (FFTs) are a class of efficient al-gorithms to compute the DFT. The FFT algorithms are based on

the principle of decomposing the computation of DFT into se-

quences of smaller DFTs. The first efficient FFT algorithm was

discovered by Gauss in the 18th century and rediscovered by

Cooley and Tukey [3] in 1960s. Later advances in the research

of FFT algorithms include the higher radix FFT [4], the mixed-

radix FFT [5], the prime-factor FFT [6], Winograd (WFTA)

FFT [7], the split-radix FFT [8], [9], the recursive FFT [10],

and the combination of decimation-in-time (DIT) and decima-

tion-in-frequency (DIF) FFT algorithms [11]. Most of these al-

gorithms illustrate FFT with similar FFT diagrams, which are

evolved from the recursive nature of the FFT algorithms andconstructed by basic butterfly structure, such as the 16-point

radix-2 DIT FFT diagram shown in Fig. 1. The complex coef-

ficient is called the twiddle factor in the

butterfly structure in the FFT diagram.

FFT algorithms can be implemented on multiple platforms.

For example, FFT algorithms have been implemented on ap-

plication-specific integrated circuits (ASIC) as FFT processors

[12]. Hardware designs of FFT processors are often tailored

to fit high-speed or low-power specifications but lack of flex-

ibility. FFT algorithms have also been implemented by soft-

ware on general-purpose processors as building block of sim-

ulation or data processing systems [13]. Software-based im-

plementations on general-purpose processors are flexible buttypically much slower than hardware implementations based

on comparable hardware technologies. Digital signal proces-

sors (DSPs) are a specific type of processors optimized for dig-

ital signal processing applications such as FIR filters, IIR fil-

ters, and FFT. Software implementations of FFT algorithms on

DSPs are becoming more popular than ASIC and general-pur-

pose processor-based implementations because they offer excel-

lent tradeoffs among cost, performance, flexibility, and imple-

mentation complexity.

However, to effectively implement FFT algorithms on DSPs

is not trivial. It has been recognized that memory references in

DSP are expensive due to their long latencies and high power

1053-587X/$25.00 2007 IEEE


2/12

WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 33 9

Fig. 1. 16-pt radix-2 DIT FFT diagram. (a) Basic radix-2 DIT FFT butterfly. (b) Complete 16-pt radix-2 DIT FFT diagram.

consumption. For example, in the TI TMS320C64x DSP [15],

the memory load operation takes five pipeline execution phases

to complete, which corresponds to four delay slots in the exe-

cution time. The implementations of FFT algorithms on DSP

involve many memory references to access butterfly inputs and

twiddle factors. In general, an -pt radix-2 FFT diagram can

be divided into stages, each of which contains a columnof butterflies. Conventional implementations of FFT algo-

rithms compute butterflies in the natural order of the FFT dia-

gram, i.e., the order of stages. The butterflies within each stage

can be computed either in parallel or in serial. Many butterflies

with identical twiddle factors can be found in multiple stages

of the FFT diagrams. For example, seven butterflies with the

twiddle factor can be found in Stage 2 to Stage 4 of the

16-pt radix-2 DIT FFT diagram in Fig. 1(b). Hence, memory

reference methods to load identical twiddle factors only once

would reduce total memory reference time and reduce power

consumption as well.

In this paper, we propose novel memory reference reductionmethods to minimize the memory references due to twiddle

factors in FFT implementations on DSP. The proposed methods

first group the butterflies with identical twiddle factors from

different stages of the FFT diagram and compute them before

computing other butterflies with different twiddle factors.

Hence, each twiddle factor is loaded only once and the re-

dundant memory references for identical twiddle factors are

removed. The memory reference reduction methods further

take advantage of the properties of twiddle factors to reduce the

number of twiddle factor lookups so that even more butterflies

can be computed by loading one twiddle factor from memory.

We have applied the memory reference reduction methods to

implement the radix-2 DIF and DIT FFT algorithms on TITMS320C64x DSP. Experimental results show that the number

of memory references and the amount of memory spaces for

twiddle factors are greatly reduced, and the number of clock

cycles to compute the radix-2 DIF FFT algorithm could also

be reduced. Our methods can be applied to other kind of FFT

algorithms as well.

In the following, Section II gives the background of the

DIF/DIT FFT algorithms and the example of a conventionalFFT implementation on DSP. Section III describes how to

implement radix-2 DIF/DIT FFT algorithms on DSP with the

memory reference reduction methods. Experimental results on

TI TMS320C64x DSP are shown in Section IV and conclusions

are drawn in Section V.

II. BACKGROUND

In this section, we will first briefly present basic ideas of the

two most widely used FFT algorithms: the DIT FFT and the DIF

FFT. We will then show the implementation of radix-2 DIF FFT

from TIs DSP library [16] as a typical example of conventional

implementation for FFT algorithms on DSP.

The DFT of discrete signal can be directly computed as

(1)

where , and are sequences of

complex numbers, and .

DIT and DIF FFT algorithms are obtained by decomposing

the input sequence and the output sequence in (1) into

successively smaller subsequences, respectively. For example,

the radix-2 DIT and DIF FFT algorithms can be obtained by

splitting and into odd and even indexed terms, re-spectively. The computation of the of radix-2 DIT and DIF FFT


3/12


Fig. 2. (a) Basic radix-2 DIF FFT butterfly. (b) Complete 16-pt radix-2 DIF FFT diagram.

algorithm can be represented by radix-2 DIT and DIF FFT dia-

grams, which are shown in Fig. 1(b) and Fig. 2(b), respectively.

The computation order of the butterflies in conventional FFT

implementations on DSP is based on the partitioning of the FFT

diagrams. In general, the FFT diagram can be partitioned into

several stages. Each stage contains a constant number of butter-

flies. For example, the -pt radix-2 DIT/DIF FFT diagram canbe partitioned into stages, each of which contains

butterflies. The butterflies within a stage have no data depen-

dencies with each other but have data dependencies with butter-

flies in other stages. For example, the butterflies in Stage 2 of

the FFT diagram in Fig. 2 have no data dependencies with each

other but have data dependencies with butterflies in both Stage

1 and Stage 3. The butterflies in the same stage of the FFT di-

agram can be further partitioned into groups. Each group con-

tains all butterflies sharing identical twiddle factors within the

same stage. Particularly, the butterflies in the Stage of the -pt

radix-2 DIT FFT diagram are divided into groups, while

the Stage of -pt radix-2 DIF FFT diagram containsgroups. Fig. 3 illustrates the partitioning of the 16-pt radix-2

DIT and DIF FFT diagrams.

Based on the partitioning of the radix-2 DIT and DIF FFT dia-

grams, the butterflies can be computed following the index order

of the stages and groups. The butterflies in the same group are

computed from top to bottom. Butterflies with identical twiddle

factors are computed in multiple stages of the FFT diagrams

in Fig. 3. For example, seven butterflies with the twiddle factor

are computed in Stage 1 to Stage 3 of the 16-pt radix-2

DIT FFT diagram in Fig. 3. Hence, identical twiddle factors are

accessed multiple times in conventional FFT implementations.

Fig. 4 shows the C code taken from TIs DSP library [15],

which implements the -pt radix-2 DIF FFT algorithm, wherethe value of is given as an input to the C code.

The C code in Fig. 4 shows a three-loop structure: 1) the

outer-most loop, the -loop, counts the stages, loops for

times; 2) the second outer loop, the -loop, counts the groups

within each stage and decides which twiddle factor to be

loaded; and 3) the inner-most loop, the -loop, computes the

butterflies within each group. The and indicate the stage and

group number, respectively. The and indicate the upper andlower input indexes of the butterfly computed by the inner-most

loop and indicates the twiddle factor to be loaded. Since the

conventional implementations strictly follow the natural order

of the FFT diagram, identical twiddle factors are loaded mul-

tiple times when computing butterflies from different stages of

the FFT diagram. For example, the C code in Fig. 4 loads the

twiddle factor at the -loop when computing butterflies

in both Stage 1 and Stage 2 of the 16-pt radix-2 DIF FFT

diagram.

III. FFT IMPLEMENTATIONS WITH THE NOVEL MEMORYREFERENCE REDUCTION METHODS

In order to remove redundant memory references due to iden-

tical twiddle factors, we propose novel memory reference re-

duction methods to implement FFT algorithms such that each

twiddle factor is loaded only once by grouping butterflies with

identical twiddle factors together. Furthermore, the proposed

methods minimize the number of twiddle factors needed in FFT

diagrams by taking advantage of properties of the twiddle fac-

tors. The memory reference reduction methods work for im-

plementations of many kinds of FFT algorithms. As examples,

we will demonstrate applications of the memory reference re-

duction methods on the two most popular FFT algorithms: theradix-2 DIF and DIT FFT algorithms.


4/12


Fig. 3. Partitioning of the 16-pt radix-2 DIT and DIF FFT. (a) Partitioningof 16-pt radix-2 DIT FFT diagram. (b) Partitioning of 16-pt radix-2 DIF FFT

diagram.

A. Grouping of Butterflies With Identical Twiddle Factors

In this subsection, we will use the radix-2 DIF FFT diagram

to demonstrate how to group and compute the butterflies with

identical twiddle factors from different stages together.

For the radix-2 DIF FFT algorithm, a butterfly in Stage is

composed with the inputs and , and twiddle

factor . Fig. 5 shows the butterfly at

Stage of an -pt radix-2 DIF FFT diagram with the corre-

sponding twiddle factor.

For example, in the second stage of a 16-pt radix-2 DIFFFT diagram shown in Fig. 2, the butterfly with the input

Fig. 4. C code of radix-2 DIF FFT from [15].

Fig. 5. Single butterfly at Stage s in radix-2 DIF FFT diagram with twiddlefactor.

and uses the twiddle factor

.

Theorem 1: In the Stage of the -pt radix-2 DIF FFT dia-

gram, there are different twiddle factors that can be repre-

sented by , where . Among them,

twiddle factors of the form for ,will not show up in stage or any other late stages.

The butterflies within a stage with identical twiddle factors

can be grouped and computed in any order without destroying

the data dependencies in the original radix-2 DIF FFT diagram.

For example, the butterflies with twiddle factors , ,

, and are only found in Stage 1 of the 16-pt radix-2

DIF FFT diagram. These butterflies can be grouped and com-

puted in any order without affecting the computations of other

butterflies. In addition, the butterflies with twiddle factors

and do not exist in any stage later than Stage 2 of the 16-pt

radix-2 DIF FFT diagram. Hence, they can be grouped and com-

puted in any order after the butterflies with twiddle factors ,, , and are computed. Similarly, the butterflies with

twiddle factor can be grouped and computed in any order

in the 16-pt radix-2 DIF FFT diagram after the butterflies with

twiddle factors and arecomputed.The butterflies with

the twiddle factor do not exist in stages later than Stage 3.

Following this principle, the computation of the -pt radix-2

DIF FFT diagram can be done in steps. Each step groups

and computes the butterflies with the twiddle factor appears in

all stages up to the stage of interest and will not occur in the

future stages of the FFT diagram. The butterflies within a step

can be computed in any order except for the butter flies with the

twiddle factor . Butterflies with the twiddle factor ap-

pear in all the stages of the FFT diagram and have data depen-dencies between the stages.


5/12


Fig. 6. Grouping butterflies with identical twiddle factors together in radix-2 DIF FFT diagram.

Reduced Memory Reference FFT algorithm: Based on

Theorem 1, the -pt radix-2 DIF FFT diagram can be

computed in steps as the following.

Step 1: Compute the butterflies with twiddle factors that

will not occur after the Stage 1 of the FFT diagram.

Compute the butterflies with twiddle factor

where in the Stage 1 of the-pt radix-2 DIF FFT diagram.

Step 2: Compute the butterflies with twiddle factors that

will not occur after the Stage 2 of the FFT diagram.

The butterflies with twiddle factors that will not occur

after the Stage 2 of the -pt radix-2 DIF FFT diagram

include: 1) butterflies in the second stage in the DIF

FFT diagram and 2) butterflies in the first stage.

These butterflies are with twiddle factors where

.

Step : Compute the butterflies with twiddle factors that

will not occur after the Stage of the FFT diagram, where

.

The butterflies with twiddle factors that will not

occur after the Stage of the -pt radix-2 DIF FFT

diagram include butterflies in the Stage ,

butterflies in the Stage , and butterflies

in the first stage of the radix-2 DIF FFT diagram.

These butterflies are with twiddle factors where

.

Step : Compute the butterflies with twiddle factor

.

Totally butterflies with twiddle factors in an

-pt radix-2 DIF FFT diagram are computed.

In this way, each twiddle factor is loaded exactly once duringthe computation of the -pt radix-2 DIF FFT diagram. To illus-

trate the above steps, we can redraw the 16-pt radix-2

DIF FFT diagram from Figs. 26, where the butterflies with

identical twiddle factors are grouped together. All butterflies

with the twiddle factor in Fig. 6 can be computed

without multiplications in the Step .

B. Reduction of the Number of Necessary Lookups of Twiddle

Factors

The method in Section III-A can reduce the number ofmemory accesses for each twiddle factor in implementing the

-pt radix-2 DIF FFT algorithm. Furthermore, the memory

references can be minimized by reducing the number of twiddle

factors to be looked up using the properties of the twiddle fac-

tors. For example, the butterflies in Step 2 of the FFT diagram

in Fig. 6 are computed with twiddle factors and . The

twiddle factor can be replaced by with a simple

derivation

Hence, only the twiddle factor is needed in Step 2. Simi-

larly, twiddle factors and can be replaced by

and , respectively. Hence, only twiddle factors

and are necessary in Step 1 of the FFT diagram in Fig. 6.

In general, we have the following property:

(2)


6/12


Fig. 7. Computing two butterflies together in one stage of the radix-2 DIF FFTdiagram.

where , ,

.The above property of twiddle factors can be applied to any

FFT algorithm and reduce the number of twiddle factors needed

to store in memory. More butterflies canbe computed by loading

one twiddle factor then grouping the butterflies with identical

twiddle factors together.

Theorem 2: Considering the two butterflies at Stage of

radix-2 DIF FFT diagram shown in Fig. 7(a), both butterflies

can be computed together by loading only one twiddle factor

, as shown in Fig. 7(b).

Proof: Based on Fig. 5, input in Stage

pairs with input to form one butterfly, and the

twiddle factor used in the butterfly is

Since and

, we have

if , the result of

the above equation is

else if , the result

becomes

Therefore, we have

, which implies that

and

The twiddle factors and

are complex numbers with separated real and imaginary parts,

which are stored separately in memory. Therefore, by loading

and ,

we can compute both butterflies.

After the number of necessary twiddle factors is reduced,

the FFT diagram in Fig. 6, where the butterflies with identical

twiddle factors are grouped together, can be further redrawn in

Fig. 8. Only three twiddle factors are needed to be looked up in

Fig. 8 comparing to seven in the original 16-pt radix-2 DIF FFT

diagram in Fig. 2.

The method proposed in this subsection to reduce the number

of twiddle factors to be looked up and to compute two butter-

flies together by loading one twiddle factor is different from

the radix-4 FFT algorithm [1]. The radix-4 FFT algorithm saves

the complex multiplication with twiddle factors by combining

two adjacent stages in the radix-2 FFT diagram. The method

proposed in this subsection does not save the complex multi-

plication in the radix-2 FFT diagram but saves the number of

memory lookups needed for the twiddle factors. Moreover, the

proposed method can be applied to radix-4 FFT algorithms to

reduce the number of twiddle factor needed to be looked up in

the radix-4 FFT diagram. Fig. 9 shows an example of reducing

the number of necessary twiddle factors in the 16-pt radix-4 DIF

FFT diagram.

The twiddle factors in the conventional 16-pt radix-4 DIF

FFT diagram shown in Fig. 9(c) are , , , , ,

and . Bytaking advantagefrom the propertiesof the twiddle

factors, the number of twiddle factors needed to be looked up in

16-pt radix-4 DIF FFT diagram are reduced to , , and

as shown in Fig. 9(d).

C. Application of Memory Reference Reduction Methods on

Radix-2 DIT FFT

The new memory reference reduction methods can be ap-

plied to implement many existing FFT algorithms. As an ex-ample, we apply the memory reference reduction methods to


7/12


Fig. 8. 16-pt radix-2 DIF FFT diagram after the proposed methods are applied.

Fig. 9. Reducing the number of twiddle factors needed to be looked up in the 16-pt radix-4 DIF FFT diagram. (a) Radix-4 DIF FFT butterfly. (b) Simplifiedrepresentation of the radix-4 DIF FFT butterfly. (c) Conventional 16-pt radix-4 DIF FFT diagram. (d) 16-pt radix-4 DIF FFT diagram with the number of twiddlefactors to be looked up reduced.

implement the radix-2 DIT FFT algorithm. Due to the differ-

ence between the DIT and DIF FFT diagrams, the method de-scribed in Sections III-A and III-B cannot be applied to radix-2

DIT FFT diagram directly. To apply the method described in

Section III-A to an -pt radix-2 DIT FFT diagram, the butter-flies with twiddle factor are grouped and computed


8/12


Fig. 10. Grouping butterflies with identical twiddle factors together in radix-2 DIT FFT diagram.

Fig. 11. 16-pt radix-2 DIT FFT diagram after the proposed methods are applied.

together before computing butterflies with other twiddle factors

in Step 1. In the following step , the butterflies with twiddle fac-

tors , where

, are computed. By grouping the butterflies with iden-

tical twiddle factors together, we can redraw the 16-pt radix-2

DIT FFT diagram in Fig. 10.

To apply the method described in Section III-B to an -pt

radix-2 DIT FFT diagram, we first compute all the butterflies

with twiddle factor in Stage 1. Then, we compute the

rest butterflies with twiddle factor and the butterflies

with twiddle factor together.Atlast, the remaining

butterflies in the FFT diagram are computed following the prin-

ciple ofSection III-B. After the number of twiddle factors to belooked up is reduced, the FFT diagram in Fig. 10 can be further

redrawn in Fig. 11. Only three twiddle factors are needed to be

looked up in Fig. 11 comparing to seven in the original 16-pt

radix-2 DIT FFT diagram in Fig. 1.

IV. PERFORMANCE EVALUATION FOR THE NOVEL MEMORY

REFERENCE REDUCTION METHODS

The number of memory references due to twiddle factors in

conventional implementations of -pt radix-2 DIF or DIT FFT

algorithms is , whichequalsto the number ofthe groupsin

the FFT diagram. Grouping the butterflies with identical twiddle

factors together reduces the number of memory references due

to twiddle factors from to . After the number of necessary twiddle factors being minimized by the properties of


9/12


Fig. 12. DIF FFT code 1 groups the butterflies with identical twiddle factors together only.

the twiddle factors, only memory referencesfor twiddle

factors are needed to implement -pt radix-2 FFT algorithms.

We have applied the memory reference reduction methods to

implement the radix-2 DIF and DIT FFT on TI TMS320C64x

DSP, which is a fixed-point DSP with enhanced very long in-

struction word (VLIW) architecture. The C64x DSP has eight

functional units that can execute a maximum of eight operations

in parallel, two register files with each 32 32-bit registers, and

32-bit internal communication bandwidth.

Four pieces of C codes are compiled with the maximum com-

piler effort (-o3) and executed in the TI Code Composer Studio

(CCS) v2.1 [14], which is the software development and sim-ulation environment for TI TMS320C64x DSP. The TIs DIF

FFT code is the radix-2 DIF FFT code in Fig. 4 taken from TIs

DSP library [15]. The DIF FFT code 1 in Fig. 12 only groups

the butterflies with identical twiddle factors together in radix-2

DIF FFT diagram without reducing the number of twiddle fac-

tors needed to be looked up. The DIF FFT code 2 in Fig. 13 is

written based on the radix-2 DIF FFT diagram in Fig. 8, where

the memory reference reduction methods are applied. Besides

the above three codes, the radix-2 DIT FFT code with memory

reference reduction methods is shown in Fig. 14, which is based

on the radix-2 DIT FFT diagram in Fig. 11. The performance

figures of the four codes are compared in Table I, including

the number of memory references due to twiddle factors, theamount of memory storage for twiddle factors, and the number

of clock cycles to compute FFT for FFTs with different sizes.

The number of clock cycles for all code to compute the FFTs

are precisely measured using the break point function in CCS.

The experimental results show that the radix-2 DIF FFT

algorithm implementation with grouping of the butterflies with

identical twiddle factors together alone can achieve average of

50.9% reduction in the number of memory references due to

twiddle factors and average of 29.7% reduction in the number

of clock cycles comparing to the conventional implementation

taken from TIs library. Furthermore, when the number of

twiddle factors needed to be looked up is also reduced, average

of 76.4% reduction in the number of memory references due totwiddle factors, average of 53.5% of memory spaces saving for

twiddle factors, and average of 36.5% reduction in the number

of clock cycles can be achieved comparing to the conventional

implementation taken from TIs library. The performance of

the radix-2 DIT FFT algorithm implementation with memory

reference reduction methods is slightly better than the radix-2

DIF FFT algorithm implementation.

V. CONCLUSION

In this paper, we propose novel memory reference reduction

methods to minimize the number of memory references due to

twiddle factors in FFT implementations on DSP. The proposed

methods first group the butterflies with identical twiddle factorsfrom different stages in the FFT diagram and compute them


10/12


Fig. 13. DIF FFT code 2 with the memory reference reduction methods based on Fig. 8.

Fig. 14. DIT FFT code with the memory reference reduction methods based on Fig. 11.

together, and then reduce the total number of necessary twiddle

factors by taking advantage from the properties of twiddlefactors. Consequently, each twiddle factor is loaded only once

and the number of memory references due to twiddle factors

can be minimized. Experimental results show the proposedmethods can achieve average of 76.4% reduction in the number


11/12


TABLE I

PERFORMANCE COMPARISON OF THE IMPLEMENTATIONS

of memory references, 53.5% saving of memory spaces due to

twiddle factors, and average of 36.5% reduction in the number

of clock cycles to compute radix-2 DIF FFT on DSP comparing

to conventional implementation.

ACKNOWLEDGMENT

The authors would like to thank the anonymous reviewers for

their careful reading and valuable comments that improved the

quality of this paper. A reviewer hasalso brought to our attentionthat C. M. Rader of MIT, in 1965, wrote an FFT program which

used the idea in Section III-A, but he did not publish it.

REFERENCES

[1] C. S. Burrus and T. W. Parks, DFT/FFT and Convolution Algorithmsand Implementation. New York: Wiley, 1985.

[2] A. V. Oppenheim and C. M. Rader, Discrete-Time Signal Processing,2nd ed. Upper Saddle River, NJ: Prentice-Hall, 1999, 0137549202.

[3] J.W.CooleyandJ.W.Tukey, Analgorithmforthemachinecalculationof complex Fourier series,Math. Comput., vol.19, pp. 297301,1965.

[4] G. D. Bergland, A radix-eight fast-Fourier transform subroutine forreal-valued series, IEEE Trans. Electroacoust., vol. AE-17, no. 2, pp.138144, Jun. 1969.

[5] R. C. Singleton, An algorithm for computing the mixed radix fastFourier transform, IEEE Trans. Audio Electroacoust., vol. AE-17, no.2, pp. 93103, Jun. 1969.

[6] D. P. Kolba and T. W. Parks, A prime factor FFT algorithmusing high-speed convolution, IEEE Trans. Acoust., Speech, SignalProcess., vol. ASSP-25, no. 4, pp. 281294, Aug. 1977.

[7] S. Winograd, On computing the discrete Fourier transform, Math.Comput., vol. 32, no. 141, pp. 175199, Jan. 1978.

[8] P. Duhamel and H. Hollmann, Split radix FFT algorithm, Electron.Lett., vol. 20, pp. 1416, Jan. 5, 1984.

[9] D. Takahashi, An extended split-radix FFT algorithm, IEEE SignalProcess. Lett., vol. 8, no. 5, pp. 145147, May 2001.

[10] A. R. Varkonyi-Koczy, A recursive fast Fourier transform algorithm,IEEE Trans. Circuits Syst. II, vol. 42, no. 9, pp. 614616, Sep. 1995.

[11] A. Saidi, Decimation-in-time-frequency FFT algorithm, in Proc.ICASSP, Apr. 1994, pp. III:453III:456.

[12] B. M. Baas, A low-power, high-performance, 1024-point FFT pro-cessor, IEEE J. Solid-State Circuits, vol. 34, no. 3, pp. 380387, Mar.1999.

[13] Matlab Function ReferenceFFT. Mathworks, Inc. [Online]. Avail-able: http://www.mathworks.com/access/helpdesk/help/techdoc/ref/fft.shtml?BB=1

[14] TMS320C6000 Programmers Guide (Rev. G), Texas Instrument,Aug. 1, 2002, SPRU198G.

[15] TMS320C64x DSP Library Programmers Reference (Rev. B), TexasInstrument, Oct. 23, 2003, SPRU565A.

Yuke Wang received the B.Sc. degree from theUniversity of Science and Technology of China,Hefei, China, in 1989, the M.Sc. degree and thePh.D. degree from the University of Saskatchewan,Saskatoon, Canada, in 1992 and 1996, respectively.

He has held faculty positions at Concordia Uni-

versity, Montreal, QC, Canada, and Florida AtlanticUniversity, Boca Raton. Currently, he is an Associate

Professor in the Computer Science Department, Uni-versity of Texas at Dallas, Richardson. He has alsoheld visiting assistant professor positions in the Uni-

versity of Minnesota, the University of Maryland, and the University of Cali-fornia at Berkeley. His research interests include VLSI design of circuits andsystems for DSP and communication, computer-aided design, and computer ar-chitectures. He has published more than 20 papers in IEEE/ACM Transactions.

Dr. Wang served as an Associate Editor of the IEEE TRANSACTIONS ONCIRCUITS AND SYSTEMS, PART II (20022003), as an Editor of the IEEETRANSACTIONS ON VLSI SYSTEMS (20012002), as an Editor of AppliedSignal Processing, and a few other journals.

Yiyan (Felix) Tang received the B.Sc. degree inelectrical engineering from South China Universityof Technology, Guangzhou, China, in 2000, and theM.Sc. in computer engineering and the Ph.D. degreein computer science from the University of Texas atDallas, Richardson, in 2002 and 2005, respectively.

Since 2005, he has been with the 3DSP Corpora-tion, Irvine, CA, where he works on design and im-plementation of wireless communication systems on

digital signal processors. His current research inter-ests lie in efficient and effective design and imple-

mentation of wireless communication and signal processing systems on digitalsignal processors.

Yingtao Jiang (M01) received the B.Eng. de-gree in biomedical engineering and electronicsfrom Chongqing University, Chongqing, China,the M.A.Sc. degree in electrical engineering fromConcordia University, Montreal, QC, Canada, andthe Ph.D. degree in computer science from theUniversity of Texas at Dallas, Richardson, in 1993,1997, and 2001, respectively.

He is currently an Assistant Professor in the De-partment of Electrical and Computer Engineering,University of Nevada, Las Vegas. His research

interests include algorithms, VLSI architectures, and circuit-level techniques

for the design of DSP, networking, and telecommunications systems, computerarchitectures, and biomedical signal processing, instrumentation, and medicalinformatics.


12/12


Jin-Gyun Chung (S90M98) received the B.S.degree in electronic engineering from ChonbukNational University, Chonju, Korea, in 1985 andthe M.S. and Ph.D. degrees in electrical engineeringfrom the University of Minnesota, Minneapolis, in1991 and 1994, respectively.

Since 1995, he has been with the Department ofElectronic and Information Engineering, Chonbuk

National University, where he is currently a Pro-fessor. His research interests are in the area of VLSIarchitectures and algorithms for signal processing

and communication systems, which include the design of high-speed andlow-power algorithms for arithmetic circuits, OFDM systems, and communi-

cation systems for automobiles.

Sang-Seob Song (S78M81) received the B.S. de-gree in electrical engineering from Chonbuk NationalUniversity in 1978 and the M.S. and Ph.D. degrees inelectrical and computer engineering from the KoreaAdvanced Institute of Science and Technology, Dae-

jeon, Korea, and the University of Manitoba, Win-nipeg, MB, Canada, in 1980 and 1990, respectively.

Since 1981, he has been with the Department ofElectronic and Information Engineering, ChonbukNational University, Jeonbuk, Korea, where he is

currently a Professor. His research interests are inthe area of high-speed modems which includes channel coding and modulation.

Myoung-Seob Lim (S85M90) received the B.S.degree in electronic engineering from Yeonsei Uni-versity, Seoul, Korea, in 1980 and the M.S. and Ph.D.degrees in electrical engineering from YonseiUniver-sity in 1982 and 1990, respectively.

He has worked at the Elecronic Telecommuni-cation Research Institute from 1985 to 1996. Since1996, he has been with the Department of Electronic

and Information Engineering, Chonbuk NationalUniversity, Jeonbuk, Korea, where he is currentlya Professor. His research interests are in the area

of design of CDMA and OFDM communication systems, which include theperformance analysis, bandwidth efficient modulation, and synchronization,and also CAN for In Vehicle Networks.

Documents

Noval Memory reference reduction