Author
pmmuhasin
View
218
Download
0
Embed Size (px)
8/3/2019 Noval Memory reference reduction
1/12
2338 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007
Novel Memory Reference Reduction Methods forFFT Implementations on DSP Processors
Yuke Wang, Yiyan (Felix) Tang, Yingtao Jiang, Member, IEEE, Jin-Gyun Chung, Member, IEEE,Sang-Seob Song, Member, IEEE, and Myoung-Seob Lim, Member, IEEE
AbstractMemory references in digital signal processors (DSP)are expensive due to their long latencies and high power consump-tion. Implementing fast Fourier transform (FFT) algorithms onDSP involves many memory references to access butterfly inputsand twiddle factors. Conventional FFT implementations requireredundant memory references to load identical twiddle factorsfor butterflies from different stages in the FFT diagrams. In thispaper, we present novel memory reference reduction methods tominimize memory references due to twiddle factors for imple-menting various different FFT algorithms on DSP. The proposedmethods first group the butterflies with identical twiddle factorsfrom different stages in the FFT diagrams and compute thembefore computing other butterflies with different twiddle factors,and then reduce the number of twiddle factor lookups by takingadvantage of the properties of twiddle factors. Consequently, eachtwiddle factor is loaded only once and the number of memory ref-erences due to twiddle factors can be minimized. We have appliedthe proposed methods to implement radix-2 DIF FFT algorithmon TI TMS320C64x DSP. Experimental results show the proposedmethods can achieve average of 76.4% reduction in the numberof memory references, 53.5% saving of memory spaces due totwiddle factors, and average of 36.5% reduction in the numberof clock cycles to compute radix-2 DIF FFT on DSP comparingto the conventional implementation. Similar performance gain isreported for implementing radix-2 DIT FFT algorithms using thenew methods.
Index TermsDigital signal processor (DSP), fast Fourier trans-form (FFT), memory reference.
I. INTRODUCTION
IN THE field of digital signal processing, the discrete Fourier
transform (DFT) plays an important role in the analysis, de-
sign, and implementation of discrete-time signal-processing al-
Manuscript received September 16, 2004; revised June 16, 2006. Thisresearch was supported by the Ministry of Information and Communication(MIC), South Korea, under the Information Technology Research Center(ITRC) support program supervised by the Institute of Information Technology
Assessment (IITA). The associate editor coordinating the review of thismanuscript and approving it for publication was Dr. Shuvra S. Bhattacharyya.
Y. Wangis with theDepartment of ComputerScience,Erik Jonsson School ofEngineering and Computer Science, University of Texas at Dallas, Richardson,TX 75083-0688 USA (e-mail: [email protected]).
Y. Tang was with the Department of Computer Science, Erik JonssonSchool of Engineering and Computer Science, University of Texas at Dallas,Richardson, TX 75083-0688 USA. He is now with the 3DSP Corporation,Irvine, CA 92618 USA (e-mail: [email protected]).
Y. Jiang is with the Department of Electrical and Computer Engineering,University of Nevada, Las Vegas, Las Vegas, NV 89154-4026 USA (e-mail:[email protected]).
J.-G. Chung, S.-S. Song, and M.-S. Lim are with the Division of Elec-trical and Information Engineering, Chonbuk National University, Jeonbuk561-756, Korea (e-mail: [email protected]; [email protected];[email protected]).
Digital Object Identifier 10.1109/TSP.2007.892722
gorithms and systems [1], [2]. For instance, the DFT can be used
to calculate a signals frequency response, and to serve as an in-
termediate step in more elaborate signal processing techniques.
The DFT of a discrete signal can be directly computed by
where , and are sequences of
complex numbers, and .
The fast Fourier transforms (FFTs) are a class of efficient al-gorithms to compute the DFT. The FFT algorithms are based on
the principle of decomposing the computation of DFT into se-
quences of smaller DFTs. The first efficient FFT algorithm was
discovered by Gauss in the 18th century and rediscovered by
Cooley and Tukey [3] in 1960s. Later advances in the research
of FFT algorithms include the higher radix FFT [4], the mixed-
radix FFT [5], the prime-factor FFT [6], Winograd (WFTA)
FFT [7], the split-radix FFT [8], [9], the recursive FFT [10],
and the combination of decimation-in-time (DIT) and decima-
tion-in-frequency (DIF) FFT algorithms [11]. Most of these al-
gorithms illustrate FFT with similar FFT diagrams, which are
evolved from the recursive nature of the FFT algorithms andconstructed by basic butterfly structure, such as the 16-point
radix-2 DIT FFT diagram shown in Fig. 1. The complex coef-
ficient is called the twiddle factor in the
butterfly structure in the FFT diagram.
FFT algorithms can be implemented on multiple platforms.
For example, FFT algorithms have been implemented on ap-
plication-specific integrated circuits (ASIC) as FFT processors
[12]. Hardware designs of FFT processors are often tailored
to fit high-speed or low-power specifications but lack of flex-
ibility. FFT algorithms have also been implemented by soft-
ware on general-purpose processors as building block of sim-
ulation or data processing systems [13]. Software-based im-
plementations on general-purpose processors are flexible buttypically much slower than hardware implementations based
on comparable hardware technologies. Digital signal proces-
sors (DSPs) are a specific type of processors optimized for dig-
ital signal processing applications such as FIR filters, IIR fil-
ters, and FFT. Software implementations of FFT algorithms on
DSPs are becoming more popular than ASIC and general-pur-
pose processor-based implementations because they offer excel-
lent tradeoffs among cost, performance, flexibility, and imple-
mentation complexity.
However, to effectively implement FFT algorithms on DSPs
is not trivial. It has been recognized that memory references in
DSP are expensive due to their long latencies and high power
1053-587X/$25.00 2007 IEEE
8/3/2019 Noval Memory reference reduction
2/12
WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 33 9
Fig. 1. 16-pt radix-2 DIT FFT diagram. (a) Basic radix-2 DIT FFT butterfly. (b) Complete 16-pt radix-2 DIT FFT diagram.
consumption. For example, in the TI TMS320C64x DSP [15],
the memory load operation takes five pipeline execution phases
to complete, which corresponds to four delay slots in the exe-
cution time. The implementations of FFT algorithms on DSP
involve many memory references to access butterfly inputs and
twiddle factors. In general, an -pt radix-2 FFT diagram can
be divided into stages, each of which contains a columnof butterflies. Conventional implementations of FFT algo-
rithms compute butterflies in the natural order of the FFT dia-
gram, i.e., the order of stages. The butterflies within each stage
can be computed either in parallel or in serial. Many butterflies
with identical twiddle factors can be found in multiple stages
of the FFT diagrams. For example, seven butterflies with the
twiddle factor can be found in Stage 2 to Stage 4 of the
16-pt radix-2 DIT FFT diagram in Fig. 1(b). Hence, memory
reference methods to load identical twiddle factors only once
would reduce total memory reference time and reduce power
consumption as well.
In this paper, we propose novel memory reference reductionmethods to minimize the memory references due to twiddle
factors in FFT implementations on DSP. The proposed methods
first group the butterflies with identical twiddle factors from
different stages of the FFT diagram and compute them before
computing other butterflies with different twiddle factors.
Hence, each twiddle factor is loaded only once and the re-
dundant memory references for identical twiddle factors are
removed. The memory reference reduction methods further
take advantage of the properties of twiddle factors to reduce the
number of twiddle factor lookups so that even more butterflies
can be computed by loading one twiddle factor from memory.
We have applied the memory reference reduction methods to
implement the radix-2 DIF and DIT FFT algorithms on TITMS320C64x DSP. Experimental results show that the number
of memory references and the amount of memory spaces for
twiddle factors are greatly reduced, and the number of clock
cycles to compute the radix-2 DIF FFT algorithm could also
be reduced. Our methods can be applied to other kind of FFT
algorithms as well.
In the following, Section II gives the background of the
DIF/DIT FFT algorithms and the example of a conventionalFFT implementation on DSP. Section III describes how to
implement radix-2 DIF/DIT FFT algorithms on DSP with the
memory reference reduction methods. Experimental results on
TI TMS320C64x DSP are shown in Section IV and conclusions
are drawn in Section V.
II. BACKGROUND
In this section, we will first briefly present basic ideas of the
two most widely used FFT algorithms: the DIT FFT and the DIF
FFT. We will then show the implementation of radix-2 DIF FFT
from TIs DSP library [16] as a typical example of conventional
implementation for FFT algorithms on DSP.
The DFT of discrete signal can be directly computed as
(1)
where , and are sequences of
complex numbers, and .
DIT and DIF FFT algorithms are obtained by decomposing
the input sequence and the output sequence in (1) into
successively smaller subsequences, respectively. For example,
the radix-2 DIT and DIF FFT algorithms can be obtained by
splitting and into odd and even indexed terms, re-spectively. The computation of the of radix-2 DIT and DIF FFT
8/3/2019 Noval Memory reference reduction
3/12
2340 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007
Fig. 2. (a) Basic radix-2 DIF FFT butterfly. (b) Complete 16-pt radix-2 DIF FFT diagram.
algorithm can be represented by radix-2 DIT and DIF FFT dia-
grams, which are shown in Fig. 1(b) and Fig. 2(b), respectively.
The computation order of the butterflies in conventional FFT
implementations on DSP is based on the partitioning of the FFT
diagrams. In general, the FFT diagram can be partitioned into
several stages. Each stage contains a constant number of butter-
flies. For example, the -pt radix-2 DIT/DIF FFT diagram canbe partitioned into stages, each of which contains
butterflies. The butterflies within a stage have no data depen-
dencies with each other but have data dependencies with butter-
flies in other stages. For example, the butterflies in Stage 2 of
the FFT diagram in Fig. 2 have no data dependencies with each
other but have data dependencies with butterflies in both Stage
1 and Stage 3. The butterflies in the same stage of the FFT di-
agram can be further partitioned into groups. Each group con-
tains all butterflies sharing identical twiddle factors within the
same stage. Particularly, the butterflies in the Stage of the -pt
radix-2 DIT FFT diagram are divided into groups, while
the Stage of -pt radix-2 DIF FFT diagram containsgroups. Fig. 3 illustrates the partitioning of the 16-pt radix-2
DIT and DIF FFT diagrams.
Based on the partitioning of the radix-2 DIT and DIF FFT dia-
grams, the butterflies can be computed following the index order
of the stages and groups. The butterflies in the same group are
computed from top to bottom. Butterflies with identical twiddle
factors are computed in multiple stages of the FFT diagrams
in Fig. 3. For example, seven butterflies with the twiddle factor
are computed in Stage 1 to Stage 3 of the 16-pt radix-2
DIT FFT diagram in Fig. 3. Hence, identical twiddle factors are
accessed multiple times in conventional FFT implementations.
Fig. 4 shows the C code taken from TIs DSP library [15],
which implements the -pt radix-2 DIF FFT algorithm, wherethe value of is given as an input to the C code.
The C code in Fig. 4 shows a three-loop structure: 1) the
outer-most loop, the -loop, counts the stages, loops for
times; 2) the second outer loop, the -loop, counts the groups
within each stage and decides which twiddle factor to be
loaded; and 3) the inner-most loop, the -loop, computes the
butterflies within each group. The and indicate the stage and
group number, respectively. The and indicate the upper andlower input indexes of the butterfly computed by the inner-most
loop and indicates the twiddle factor to be loaded. Since the
conventional implementations strictly follow the natural order
of the FFT diagram, identical twiddle factors are loaded mul-
tiple times when computing butterflies from different stages of
the FFT diagram. For example, the C code in Fig. 4 loads the
twiddle factor at the -loop when computing butterflies
in both Stage 1 and Stage 2 of the 16-pt radix-2 DIF FFT
diagram.
III. FFT IMPLEMENTATIONS WITH THE NOVEL MEMORYREFERENCE REDUCTION METHODS
In order to remove redundant memory references due to iden-
tical twiddle factors, we propose novel memory reference re-
duction methods to implement FFT algorithms such that each
twiddle factor is loaded only once by grouping butterflies with
identical twiddle factors together. Furthermore, the proposed
methods minimize the number of twiddle factors needed in FFT
diagrams by taking advantage of properties of the twiddle fac-
tors. The memory reference reduction methods work for im-
plementations of many kinds of FFT algorithms. As examples,
we will demonstrate applications of the memory reference re-
duction methods on the two most popular FFT algorithms: theradix-2 DIF and DIT FFT algorithms.
8/3/2019 Noval Memory reference reduction
4/12
WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 1
Fig. 3. Partitioning of the 16-pt radix-2 DIT and DIF FFT. (a) Partitioningof 16-pt radix-2 DIT FFT diagram. (b) Partitioning of 16-pt radix-2 DIF FFT
diagram.
A. Grouping of Butterflies With Identical Twiddle Factors
In this subsection, we will use the radix-2 DIF FFT diagram
to demonstrate how to group and compute the butterflies with
identical twiddle factors from different stages together.
For the radix-2 DIF FFT algorithm, a butterfly in Stage is
composed with the inputs and , and twiddle
factor . Fig. 5 shows the butterfly at
Stage of an -pt radix-2 DIF FFT diagram with the corre-
sponding twiddle factor.
For example, in the second stage of a 16-pt radix-2 DIFFFT diagram shown in Fig. 2, the butterfly with the input
Fig. 4. C code of radix-2 DIF FFT from [15].
Fig. 5. Single butterfly at Stage s in radix-2 DIF FFT diagram with twiddlefactor.
and uses the twiddle factor
.
Theorem 1: In the Stage of the -pt radix-2 DIF FFT dia-
gram, there are different twiddle factors that can be repre-
sented by , where . Among them,
twiddle factors of the form for ,will not show up in stage or any other late stages.
The butterflies within a stage with identical twiddle factors
can be grouped and computed in any order without destroying
the data dependencies in the original radix-2 DIF FFT diagram.
For example, the butterflies with twiddle factors , ,
, and are only found in Stage 1 of the 16-pt radix-2
DIF FFT diagram. These butterflies can be grouped and com-
puted in any order without affecting the computations of other
butterflies. In addition, the butterflies with twiddle factors
and do not exist in any stage later than Stage 2 of the 16-pt
radix-2 DIF FFT diagram. Hence, they can be grouped and com-
puted in any order after the butterflies with twiddle factors ,, , and are computed. Similarly, the butterflies with
twiddle factor can be grouped and computed in any order
in the 16-pt radix-2 DIF FFT diagram after the butterflies with
twiddle factors and arecomputed.The butterflies with
the twiddle factor do not exist in stages later than Stage 3.
Following this principle, the computation of the -pt radix-2
DIF FFT diagram can be done in steps. Each step groups
and computes the butterflies with the twiddle factor appears in
all stages up to the stage of interest and will not occur in the
future stages of the FFT diagram. The butterflies within a step
can be computed in any order except for the butter flies with the
twiddle factor . Butterflies with the twiddle factor ap-
pear in all the stages of the FFT diagram and have data depen-dencies between the stages.
8/3/2019 Noval Memory reference reduction
5/12
2342 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007
Fig. 6. Grouping butterflies with identical twiddle factors together in radix-2 DIF FFT diagram.
Reduced Memory Reference FFT algorithm: Based on
Theorem 1, the -pt radix-2 DIF FFT diagram can be
computed in steps as the following.
Step 1: Compute the butterflies with twiddle factors that
will not occur after the Stage 1 of the FFT diagram.
Compute the butterflies with twiddle factor
where in the Stage 1 of the-pt radix-2 DIF FFT diagram.
Step 2: Compute the butterflies with twiddle factors that
will not occur after the Stage 2 of the FFT diagram.
The butterflies with twiddle factors that will not occur
after the Stage 2 of the -pt radix-2 DIF FFT diagram
include: 1) butterflies in the second stage in the DIF
FFT diagram and 2) butterflies in the first stage.
These butterflies are with twiddle factors where
.
Step : Compute the butterflies with twiddle factors that
will not occur after the Stage of the FFT diagram, where
.
The butterflies with twiddle factors that will not
occur after the Stage of the -pt radix-2 DIF FFT
diagram include butterflies in the Stage ,
butterflies in the Stage , and butterflies
in the first stage of the radix-2 DIF FFT diagram.
These butterflies are with twiddle factors where
.
Step : Compute the butterflies with twiddle factor
.
Totally butterflies with twiddle factors in an
-pt radix-2 DIF FFT diagram are computed.
In this way, each twiddle factor is loaded exactly once duringthe computation of the -pt radix-2 DIF FFT diagram. To illus-
trate the above steps, we can redraw the 16-pt radix-2
DIF FFT diagram from Figs. 26, where the butterflies with
identical twiddle factors are grouped together. All butterflies
with the twiddle factor in Fig. 6 can be computed
without multiplications in the Step .
B. Reduction of the Number of Necessary Lookups of Twiddle
Factors
The method in Section III-A can reduce the number ofmemory accesses for each twiddle factor in implementing the
-pt radix-2 DIF FFT algorithm. Furthermore, the memory
references can be minimized by reducing the number of twiddle
factors to be looked up using the properties of the twiddle fac-
tors. For example, the butterflies in Step 2 of the FFT diagram
in Fig. 6 are computed with twiddle factors and . The
twiddle factor can be replaced by with a simple
derivation
Hence, only the twiddle factor is needed in Step 2. Simi-
larly, twiddle factors and can be replaced by
and , respectively. Hence, only twiddle factors
and are necessary in Step 1 of the FFT diagram in Fig. 6.
In general, we have the following property:
(2)
8/3/2019 Noval Memory reference reduction
6/12
WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 3
Fig. 7. Computing two butterflies together in one stage of the radix-2 DIF FFTdiagram.
where , ,
.The above property of twiddle factors can be applied to any
FFT algorithm and reduce the number of twiddle factors needed
to store in memory. More butterflies canbe computed by loading
one twiddle factor then grouping the butterflies with identical
twiddle factors together.
Theorem 2: Considering the two butterflies at Stage of
radix-2 DIF FFT diagram shown in Fig. 7(a), both butterflies
can be computed together by loading only one twiddle factor
, as shown in Fig. 7(b).
Proof: Based on Fig. 5, input in Stage
pairs with input to form one butterfly, and the
twiddle factor used in the butterfly is
Since and
, we have
if , the result of
the above equation is
else if , the result
becomes
Therefore, we have
, which implies that
and
The twiddle factors and
are complex numbers with separated real and imaginary parts,
which are stored separately in memory. Therefore, by loading
and ,
we can compute both butterflies.
After the number of necessary twiddle factors is reduced,
the FFT diagram in Fig. 6, where the butterflies with identical
twiddle factors are grouped together, can be further redrawn in
Fig. 8. Only three twiddle factors are needed to be looked up in
Fig. 8 comparing to seven in the original 16-pt radix-2 DIF FFT
diagram in Fig. 2.
The method proposed in this subsection to reduce the number
of twiddle factors to be looked up and to compute two butter-
flies together by loading one twiddle factor is different from
the radix-4 FFT algorithm [1]. The radix-4 FFT algorithm saves
the complex multiplication with twiddle factors by combining
two adjacent stages in the radix-2 FFT diagram. The method
proposed in this subsection does not save the complex multi-
plication in the radix-2 FFT diagram but saves the number of
memory lookups needed for the twiddle factors. Moreover, the
proposed method can be applied to radix-4 FFT algorithms to
reduce the number of twiddle factor needed to be looked up in
the radix-4 FFT diagram. Fig. 9 shows an example of reducing
the number of necessary twiddle factors in the 16-pt radix-4 DIF
FFT diagram.
The twiddle factors in the conventional 16-pt radix-4 DIF
FFT diagram shown in Fig. 9(c) are , , , , ,
and . Bytaking advantagefrom the propertiesof the twiddle
factors, the number of twiddle factors needed to be looked up in
16-pt radix-4 DIF FFT diagram are reduced to , , and
as shown in Fig. 9(d).
C. Application of Memory Reference Reduction Methods on
Radix-2 DIT FFT
The new memory reference reduction methods can be ap-
plied to implement many existing FFT algorithms. As an ex-ample, we apply the memory reference reduction methods to
8/3/2019 Noval Memory reference reduction
7/12
2344 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007
Fig. 8. 16-pt radix-2 DIF FFT diagram after the proposed methods are applied.
Fig. 9. Reducing the number of twiddle factors needed to be looked up in the 16-pt radix-4 DIF FFT diagram. (a) Radix-4 DIF FFT butterfly. (b) Simplifiedrepresentation of the radix-4 DIF FFT butterfly. (c) Conventional 16-pt radix-4 DIF FFT diagram. (d) 16-pt radix-4 DIF FFT diagram with the number of twiddlefactors to be looked up reduced.
implement the radix-2 DIT FFT algorithm. Due to the differ-
ence between the DIT and DIF FFT diagrams, the method de-scribed in Sections III-A and III-B cannot be applied to radix-2
DIT FFT diagram directly. To apply the method described in
Section III-A to an -pt radix-2 DIT FFT diagram, the butter-flies with twiddle factor are grouped and computed
8/3/2019 Noval Memory reference reduction
8/12
WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 5
Fig. 10. Grouping butterflies with identical twiddle factors together in radix-2 DIT FFT diagram.
Fig. 11. 16-pt radix-2 DIT FFT diagram after the proposed methods are applied.
together before computing butterflies with other twiddle factors
in Step 1. In the following step , the butterflies with twiddle fac-
tors , where
, are computed. By grouping the butterflies with iden-
tical twiddle factors together, we can redraw the 16-pt radix-2
DIT FFT diagram in Fig. 10.
To apply the method described in Section III-B to an -pt
radix-2 DIT FFT diagram, we first compute all the butterflies
with twiddle factor in Stage 1. Then, we compute the
rest butterflies with twiddle factor and the butterflies
with twiddle factor together.Atlast, the remaining
butterflies in the FFT diagram are computed following the prin-
ciple ofSection III-B. After the number of twiddle factors to belooked up is reduced, the FFT diagram in Fig. 10 can be further
redrawn in Fig. 11. Only three twiddle factors are needed to be
looked up in Fig. 11 comparing to seven in the original 16-pt
radix-2 DIT FFT diagram in Fig. 1.
IV. PERFORMANCE EVALUATION FOR THE NOVEL MEMORY
REFERENCE REDUCTION METHODS
The number of memory references due to twiddle factors in
conventional implementations of -pt radix-2 DIF or DIT FFT
algorithms is , whichequalsto the number ofthe groupsin
the FFT diagram. Grouping the butterflies with identical twiddle
factors together reduces the number of memory references due
to twiddle factors from to . After the number of necessary twiddle factors being minimized by the properties of
8/3/2019 Noval Memory reference reduction
9/12
2346 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007
Fig. 12. DIF FFT code 1 groups the butterflies with identical twiddle factors together only.
the twiddle factors, only memory referencesfor twiddle
factors are needed to implement -pt radix-2 FFT algorithms.
We have applied the memory reference reduction methods to
implement the radix-2 DIF and DIT FFT on TI TMS320C64x
DSP, which is a fixed-point DSP with enhanced very long in-
struction word (VLIW) architecture. The C64x DSP has eight
functional units that can execute a maximum of eight operations
in parallel, two register files with each 32 32-bit registers, and
32-bit internal communication bandwidth.
Four pieces of C codes are compiled with the maximum com-
piler effort (-o3) and executed in the TI Code Composer Studio
(CCS) v2.1 [14], which is the software development and sim-ulation environment for TI TMS320C64x DSP. The TIs DIF
FFT code is the radix-2 DIF FFT code in Fig. 4 taken from TIs
DSP library [15]. The DIF FFT code 1 in Fig. 12 only groups
the butterflies with identical twiddle factors together in radix-2
DIF FFT diagram without reducing the number of twiddle fac-
tors needed to be looked up. The DIF FFT code 2 in Fig. 13 is
written based on the radix-2 DIF FFT diagram in Fig. 8, where
the memory reference reduction methods are applied. Besides
the above three codes, the radix-2 DIT FFT code with memory
reference reduction methods is shown in Fig. 14, which is based
on the radix-2 DIT FFT diagram in Fig. 11. The performance
figures of the four codes are compared in Table I, including
the number of memory references due to twiddle factors, theamount of memory storage for twiddle factors, and the number
of clock cycles to compute FFT for FFTs with different sizes.
The number of clock cycles for all code to compute the FFTs
are precisely measured using the break point function in CCS.
The experimental results show that the radix-2 DIF FFT
algorithm implementation with grouping of the butterflies with
identical twiddle factors together alone can achieve average of
50.9% reduction in the number of memory references due to
twiddle factors and average of 29.7% reduction in the number
of clock cycles comparing to the conventional implementation
taken from TIs library. Furthermore, when the number of
twiddle factors needed to be looked up is also reduced, average
of 76.4% reduction in the number of memory references due totwiddle factors, average of 53.5% of memory spaces saving for
twiddle factors, and average of 36.5% reduction in the number
of clock cycles can be achieved comparing to the conventional
implementation taken from TIs library. The performance of
the radix-2 DIT FFT algorithm implementation with memory
reference reduction methods is slightly better than the radix-2
DIF FFT algorithm implementation.
V. CONCLUSION
In this paper, we propose novel memory reference reduction
methods to minimize the number of memory references due to
twiddle factors in FFT implementations on DSP. The proposed
methods first group the butterflies with identical twiddle factorsfrom different stages in the FFT diagram and compute them
8/3/2019 Noval Memory reference reduction
10/12
WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 7
Fig. 13. DIF FFT code 2 with the memory reference reduction methods based on Fig. 8.
Fig. 14. DIT FFT code with the memory reference reduction methods based on Fig. 11.
together, and then reduce the total number of necessary twiddle
factors by taking advantage from the properties of twiddlefactors. Consequently, each twiddle factor is loaded only once
and the number of memory references due to twiddle factors
can be minimized. Experimental results show the proposedmethods can achieve average of 76.4% reduction in the number
8/3/2019 Noval Memory reference reduction
11/12
2348 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 5, MAY 2007
TABLE I
PERFORMANCE COMPARISON OF THE IMPLEMENTATIONS
of memory references, 53.5% saving of memory spaces due to
twiddle factors, and average of 36.5% reduction in the number
of clock cycles to compute radix-2 DIF FFT on DSP comparing
to conventional implementation.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers for
their careful reading and valuable comments that improved the
quality of this paper. A reviewer hasalso brought to our attentionthat C. M. Rader of MIT, in 1965, wrote an FFT program which
used the idea in Section III-A, but he did not publish it.
REFERENCES
[1] C. S. Burrus and T. W. Parks, DFT/FFT and Convolution Algorithmsand Implementation. New York: Wiley, 1985.
[2] A. V. Oppenheim and C. M. Rader, Discrete-Time Signal Processing,2nd ed. Upper Saddle River, NJ: Prentice-Hall, 1999, 0137549202.
[3] J.W.CooleyandJ.W.Tukey, Analgorithmforthemachinecalculationof complex Fourier series,Math. Comput., vol.19, pp. 297301,1965.
[4] G. D. Bergland, A radix-eight fast-Fourier transform subroutine forreal-valued series, IEEE Trans. Electroacoust., vol. AE-17, no. 2, pp.138144, Jun. 1969.
[5] R. C. Singleton, An algorithm for computing the mixed radix fastFourier transform, IEEE Trans. Audio Electroacoust., vol. AE-17, no.2, pp. 93103, Jun. 1969.
[6] D. P. Kolba and T. W. Parks, A prime factor FFT algorithmusing high-speed convolution, IEEE Trans. Acoust., Speech, SignalProcess., vol. ASSP-25, no. 4, pp. 281294, Aug. 1977.
[7] S. Winograd, On computing the discrete Fourier transform, Math.Comput., vol. 32, no. 141, pp. 175199, Jan. 1978.
[8] P. Duhamel and H. Hollmann, Split radix FFT algorithm, Electron.Lett., vol. 20, pp. 1416, Jan. 5, 1984.
[9] D. Takahashi, An extended split-radix FFT algorithm, IEEE SignalProcess. Lett., vol. 8, no. 5, pp. 145147, May 2001.
[10] A. R. Varkonyi-Koczy, A recursive fast Fourier transform algorithm,IEEE Trans. Circuits Syst. II, vol. 42, no. 9, pp. 614616, Sep. 1995.
[11] A. Saidi, Decimation-in-time-frequency FFT algorithm, in Proc.ICASSP, Apr. 1994, pp. III:453III:456.
[12] B. M. Baas, A low-power, high-performance, 1024-point FFT pro-cessor, IEEE J. Solid-State Circuits, vol. 34, no. 3, pp. 380387, Mar.1999.
[13] Matlab Function ReferenceFFT. Mathworks, Inc. [Online]. Avail-able: http://www.mathworks.com/access/helpdesk/help/techdoc/ref/fft.shtml?BB=1
[14] TMS320C6000 Programmers Guide (Rev. G), Texas Instrument,Aug. 1, 2002, SPRU198G.
[15] TMS320C64x DSP Library Programmers Reference (Rev. B), TexasInstrument, Oct. 23, 2003, SPRU565A.
Yuke Wang received the B.Sc. degree from theUniversity of Science and Technology of China,Hefei, China, in 1989, the M.Sc. degree and thePh.D. degree from the University of Saskatchewan,Saskatoon, Canada, in 1992 and 1996, respectively.
He has held faculty positions at Concordia Uni-
versity, Montreal, QC, Canada, and Florida AtlanticUniversity, Boca Raton. Currently, he is an Associate
Professor in the Computer Science Department, Uni-versity of Texas at Dallas, Richardson. He has alsoheld visiting assistant professor positions in the Uni-
versity of Minnesota, the University of Maryland, and the University of Cali-fornia at Berkeley. His research interests include VLSI design of circuits andsystems for DSP and communication, computer-aided design, and computer ar-chitectures. He has published more than 20 papers in IEEE/ACM Transactions.
Dr. Wang served as an Associate Editor of the IEEE TRANSACTIONS ONCIRCUITS AND SYSTEMS, PART II (20022003), as an Editor of the IEEETRANSACTIONS ON VLSI SYSTEMS (20012002), as an Editor of AppliedSignal Processing, and a few other journals.
Yiyan (Felix) Tang received the B.Sc. degree inelectrical engineering from South China Universityof Technology, Guangzhou, China, in 2000, and theM.Sc. in computer engineering and the Ph.D. degreein computer science from the University of Texas atDallas, Richardson, in 2002 and 2005, respectively.
Since 2005, he has been with the 3DSP Corpora-tion, Irvine, CA, where he works on design and im-plementation of wireless communication systems on
digital signal processors. His current research inter-ests lie in efficient and effective design and imple-
mentation of wireless communication and signal processing systems on digitalsignal processors.
Yingtao Jiang (M01) received the B.Eng. de-gree in biomedical engineering and electronicsfrom Chongqing University, Chongqing, China,the M.A.Sc. degree in electrical engineering fromConcordia University, Montreal, QC, Canada, andthe Ph.D. degree in computer science from theUniversity of Texas at Dallas, Richardson, in 1993,1997, and 2001, respectively.
He is currently an Assistant Professor in the De-partment of Electrical and Computer Engineering,University of Nevada, Las Vegas. His research
interests include algorithms, VLSI architectures, and circuit-level techniques
for the design of DSP, networking, and telecommunications systems, computerarchitectures, and biomedical signal processing, instrumentation, and medicalinformatics.
8/3/2019 Noval Memory reference reduction
12/12
WANG et al.: N OVE L M EMORY R EFE REN CE RE DUCT IO N MET HODS FO R F FT IMP LEME NTATIONS ON DS P P ROCE SS ORS 2 34 9
Jin-Gyun Chung (S90M98) received the B.S.degree in electronic engineering from ChonbukNational University, Chonju, Korea, in 1985 andthe M.S. and Ph.D. degrees in electrical engineeringfrom the University of Minnesota, Minneapolis, in1991 and 1994, respectively.
Since 1995, he has been with the Department ofElectronic and Information Engineering, Chonbuk
National University, where he is currently a Pro-fessor. His research interests are in the area of VLSIarchitectures and algorithms for signal processing
and communication systems, which include the design of high-speed andlow-power algorithms for arithmetic circuits, OFDM systems, and communi-
cation systems for automobiles.
Sang-Seob Song (S78M81) received the B.S. de-gree in electrical engineering from Chonbuk NationalUniversity in 1978 and the M.S. and Ph.D. degrees inelectrical and computer engineering from the KoreaAdvanced Institute of Science and Technology, Dae-
jeon, Korea, and the University of Manitoba, Win-nipeg, MB, Canada, in 1980 and 1990, respectively.
Since 1981, he has been with the Department ofElectronic and Information Engineering, ChonbukNational University, Jeonbuk, Korea, where he is
currently a Professor. His research interests are inthe area of high-speed modems which includes channel coding and modulation.
Myoung-Seob Lim (S85M90) received the B.S.degree in electronic engineering from Yeonsei Uni-versity, Seoul, Korea, in 1980 and the M.S. and Ph.D.degrees in electrical engineering from YonseiUniver-sity in 1982 and 1990, respectively.
He has worked at the Elecronic Telecommuni-cation Research Institute from 1985 to 1996. Since1996, he has been with the Department of Electronic
and Information Engineering, Chonbuk NationalUniversity, Jeonbuk, Korea, where he is currentlya Professor. His research interests are in the area
of design of CDMA and OFDM communication systems, which include theperformance analysis, bandwidth efficient modulation, and synchronization,and also CAN for In Vehicle Networks.