Upload
viktor-k
View
216
Download
3
Embed Size (px)
Citation preview
ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE∗
Ren Chen, Hoang Le, and Viktor K. Prasanna
Ming Hsieh Department of Electrical Engineering
University of Southern California, Los Angeles, USA 90089
Email: {renchen, hoangle, prasanna}@usc.edu
ABSTRACT
In this paper, we revisit the classic Fast Fourier Trans-
form (FFT) for energy efficient designs on FPGAs. A pa-
rameterized FFT architecture is proposed to identify the de-
sign trade-offs in achieving energy efficiency. We first per-
form design space exploration by varying the algorithm map-
ping parameters, such as the degree of vertical and horizon-
tal parallelism, that characterize decomposition based FFT
algorithms. Then we explore an energy efficient design by
empirical selection on the values of the chosen architec-
ture parameters, including the type of memory elements, the
type of interconnection network and the number of pipeline
stages. The trade offs between energy, area, and time are
analyzed using two performance metrics: the energy effi-
ciency (defined as the number of operations per Joule) and
the Energy×Area×Time (EAT) composite metric. From the
experimental results, a design space is generated to demon-
strate the effect of these parameters on the various perfor-
mance metrics. For N -point FFT (16 ≤ N ≤ 1024), our
designs achieve up to 28% and 38% improvement in the
energy efficiency and EAT, respectively, compared with a
state-of-the-art design.
1. INTRODUCTION
FPGA is a promising implementation technology for com-
putationally intensive applications such as signal, image, and
network processing tasks [1, 2]. State-of-the-art FPGAs of-
fer high operating frequency, unprecedented logic density
and a host of other features. As FPGAs are programmed
specifically for the problem to be solved, they can achieve
higher performance with lower power consumption than gen-
eral purpose processors.
Fast Fourier Transform (FFT) is one of the most fre-
quently used kernels in a wide variety of image and sig-
nal processing applications. Various derivative FFT algo-
rithms have been proposed and developed. Radix-x Cooley-
Tukey algorithm is one of the most popular algorithms for
∗This work has been funded by DARPA under grant number HR0011-
12-2-0023.
hardware implementation [3, 4, 5, 6]. Most hardware so-
lutions for Radix-x FFT fall into the following categories:
delay feedback or delay commutator architectures [4], such
as Radix-22 single-path delay feedback FFT [4] and Radix-
4 single-path delay commutator FFT [5]. By focusing on
circuit level optimizations, these solutions achieve improve-
ment either in throughput, area, or power.
Energy efficiency is a key design metric. To obtain an
energy efficient design for FFT, we analyze the trade-offs
between energy, area, and time for fixed-point FFT on a
parameterized architecture, using Cooley-Tukey algorithm.
Energy efficiency can be achieved both at the algorithm map-
ping level and the architecture level [7, 8]. Optimizing at
these two levels allows power to be effectively traded off
with other performance parameters. For example, a design
consuming 2× power but achieving 3× system throughput is
actually 50% more energy efficient than the original design.
We present the architecture design space with respect to en-
ergy efficiency at the algorithm mapping level. By empirical
selection of the proposed architecture parameter values, we
explore an energy efficient design at the architecture level.
In this paper, we make the following contributions:
1. A parameterized FFT architecture using the Radix-4
Cooley-Tukey algorithm (Section 3.1).
2. A design space that demonstrates the effect of the pa-
rameters on the Energy×Area×Time (EAT) compos-
ite metric and the energy efficiency (Section 4.3.2).
3. Demonstrate improved energy efficiency of the pro-
posed design by identifying energy hot-spots and vary-
ing the chosen architecture parameters (Section 4.3.2).
4. Optimized designs achieving significant improvement
in energy efficiency compared with a state-of-the-art
design (Section 4.4).
The rest of the paper is organized as follows. Section 2
covers the background and related work. Section 3 describes
the proposed parameterized architecture and its implemen-
tation on FPGA. Section 4 presents experimental results and
analysis. Section 5 concludes the paper.
1
978-1-4799-0004-6/13/$31.00 ©2013 IEEE
2. BACKGROUND AND RELATED WORK
2.1. Background
Given N complex numbers x0, ..., xN−1, Discrete Fourier
Transform (DFT) is computed as: Xk =∑N−1
n=0 xne−i2πk n
N ,k = 0, ..., N − 1. Radix-x Cooley-Tukey FFT is a well
known decomposition based algorithm for N-point DFT. In
this paper, we employ Radix-4 FFT for our design. The de-
scription of Radix-4 FFT is presented in Algorithm 1. In
terms of the number of real operations, the computational
complexity of N -point Radix-4 FFT is O(N log4 N). The
algorithm performs N -point FFT in N/m (m < N) cycles
using m Input/Output ports (I/Os) and log4 N radix blocks,
which are used for butterfly computations. The algorithm it-
eratively decomposes the entire problem into four subprob-
lems. This feature enables us to map the algorithm by fold-
Algorithm 1 Radix-4 FFT Algorithm
1: q = N/4; d = N/4;
2: for p := 0 to log4 N do
3: for k := 0 to 4p − 1 do
4: l = 4kq/4p; r = l + q/(4p − 1);5: tw1 = w[k]; tw2 = w[2k]; tw3 = w[3k];6: for i := l to r do
7: t0 = i; t1 = i+d/4p; t2 = i+2d/4p; t3 = i+3d/4p;8: do parallel
9: fp+1[t0] = fp[t0] + fp[t1] + fp[t2] + fp[t3];10: fp+1[t1] = fp[t0]− jfp[t1]− fp[t2] + jfp[t3];11: fp+1[t2] = fp[t0]− fp[t1] + fp[t2] + jfp[t3];12: fp+1[t3] = fp[t0] + jfp[t1]− fp[t2]− jfp[t3];13: end parallel
14: do parallel
15: fp+1[t0] = fp+1[t0];16: fp+1[t1] = tw1 × fp+1[t1];17: fp+1[t2] = tw2 × fp+1[t2];18: fp+1[t3] = tw3 × xp+1[t3];19: end parallel
20: end for
21: end for
22: end for
ing the FFT architecture vertically or horizontally, thus pro-
viding much freedom to implement various designs on FP-
GAs. We propose our parameterized architecture in Section
3.2 based on this feature.
2.2. Related Work
To the best of our knowledge, there has been no previous
work targeted at exploring the design space for energy effi-
ciency of FFT at both the algorithm mapping level and the
architecture level on FPGAs. Existing work has mainly fo-
cused on optimizing the performance, power and area of the
design at the circuit level.
An energy-efficient 1024-point FFT processor was de-
veloped in [9]. Cache-based FFT algorithm was proposed
to achieve low power and high performance. Energy-time
performance metric was evaluated at various processor op-
eration points. In [10], a high-speed and low-power FFT ar-
chitecture was presented. They presented a delay balanced
pipeline architecture based on split-radix algorithm. Algo-
rithms for reducing computation complexity were explored
and the architecture was evaluated in area, power and timing
performance.
Based on Radix-x FFT, various pipeline FFT architec-
tures have been proposed, such as Radix-2 single-path de-
lay feedback FFT [3], Radix-4 single-path delay commuta-
tor FFT [5], Radix-2 multi-path delay commutator FFT [6],
and Radix-22 single-path delay feedback FFT [4]. These
architectures can achieve high throughput per unit area with
single-path or multi-path pipelines, but energy efficiency has
not been explored in these works.
In [11], a parameterized soft core generator for high
throughput DFT was developed. This generator can auto-
matically produce an optimized design with user inputs for
performance and resource constraints. However, energy ef-
ficiency is not considered in this work. In [7], the author
presented a parameterized energy efficient FFT architecture.
Their design is optimized to achieve high energy efficiency
by varying the architecture parameters. Some energy effi-
cient design techniques, such as clock gating and memory
binding, are also employed in their work.
Other than FPGA, there are also some techniques for en-
ergy efficient FFT presented based on other different plat-
forms [12, 13]. However, it is not clear how to apply these
techniques on FPGAs. In this work, we extend the work of
[7] by design space exploration at multiple levels. The de-
sign space exploration is performed on the current state-of-
the-art FPGAs. By exploring the energy-performance-area
trade-offs at mutiple levels, we obtain an energy efficient
design for FFT.
3. ARCHITECTURE AND IMPLEMENTATIONS
3.1. Architecture building blocks
The proposed N -point FFT architecture is based on the Radix-
4 Cooley-Tukey FFT algorithm. Note that the choice of the
radix affects energy efficiency of the design. Compared with
Radix-2 algorithm, Radix-4 uses fewer multiply operations.
The proposed architecture consists of five building blocks
(see Fig.1): Radix-4 block (R4), Data buffer, Data path per-
mutation (PER), Parallel-to-serial/serial-to-parallel (PS/SP)
multiplexer, and twiddle factor computation (TWC). A com-
plete design for N -point FFT can be obtained by a combi-
nation of the basic blocks.
A. Radix-4 block
In this module, 16 signed adder/subtractors are used to
complete butterfly computations. It takes four inputs and
generates four outputs in parallel. Each input data contains
real and imaginary components. The data outputs of R4 will
be used by the twiddle factor computation block except in
the last stage (see Fig. 1a).
B. Data buffer
2
(a) (b) (c) (d) (e)
Fig. 1: (a) Radix block, (b) Data buffer, (c) Data path permu-
tation (PER) , (d) Parallel-to-serial/serial-to-parallel MUX
(PS/SP), (e) Twiddle factor computation (TWC)
Fig. 2: Data permutation in the data buffers for 16-point FFT
Each data buffer consists of a dual-port RAM having
N/m (m equals to the number of I/Os) entries. Data is
written into one port and read from the other port simultane-
ously. Fig. 2 shows the data buffers for 16-point pipelining
FFT. In four cycles, 16 permutated data inputs are fed into
the data buffers. In each cycle, with alternating locations,
four data outputs are read in parallel. For different architec-
tural parameters, the read and write addresses are generated
with different strides. For example, in Fig. 2, four data in-
puts (X0, X4, X8, X12) are written in cycle 0, cycle 1, cycle
2, and cycle 3 respectively. Then they are output simultane-
ously in cycle 4.
C. Data permutation block
Parallel input data are required to be permutated before
being processed by the subsequent modules. Fig. 2 shows
the data permutation for 16-point FFT. In the first cycle, four
data inputs (X0, X1, X2, X3) are fed into the first entry of
each data buffer without permutation. In the second cycle,
another four data inputs are written into the second entry of
each data buffer with one location permutated. The parallel
output data (Xi, Xi+4, Xi+8, X(i+12)mod16, i = 0, 1, 2, 3)are stored in different RAMs after four cycles. These per-
mutations are repeated every four cycles.
D. PS/SP module
This module is used to multiplex serial/parallel input
data to output in parallel/serial respectively. As shown in
Fig. 3a, the number of I/Os is limited to one, but the radix-4
block still operates on four data inputs in parallel, thus the
PS/SP module is employed to match the data rate both be-
fore and after the radix-4 block.
(a) Hp = 1,Vp = 1
(b) Hp = 2,Vp = 4
Fig. 3: Parameterized Architectures for 16-point FFT
E. Twiddle factor computation
This module consists of two blocks: the twiddle factor
generation block and the complex number multiplier block.
The twiddle factor generation block includes several lookup
tables for storing twiddle factor coefficients, where the data
read addresses will be updated with the control signals. The
size of the lookup tables will increase with the problem size.
The complex number multiplier block consists of three mul-
tipliers and three adder/subtractors.
3.2. Parameterized FFT Architecture
3.2.1. Algorithm Mapping Parameters
Decomposition based Radix-4 FFT offers much flexibility
to map various architectures. Folding the FFT architecture
enables the radix-4 blocks to be reused iteratively to save
area, while unfolding the FFT increases spatial parallelism.
Hence we use two algorithm mapping parameters that char-
acterize the decomposition-based N -point FFT algorithm in
our design:
1. Horizontal Parallelism (Hp): determines the number
of radix-4 blocks concatenated horizontally (1 ≤ Hp ≤log4 N ).
2. Vertical Parallelism (Vp): determines the number of
parallel I/Os (1 ≤ Vp ≤ N ). Vp is determined by
the number of I/Os per pipeline (Nc) and the number
of parallel pipelines (Np), and Vp = Nc × Np. Each
pipeline is a row of horizontally concatenated radix-4
blocks.
We adapt these two architectural parameters for an en-
ergy efficient design. Two different architectures are pre-
sented in Fig. 3. In Fig. 3a, Vp = Nc = Np = 1, Hp =1, N = 16, one radix-4 block is employed and iteratively
3
(a) (b) (c)
Fig. 4: (a) Crossbar network, (b) Complete binary tree, (c)
Dynamic network
used by two stages, and one input data is processed per cy-
cle. This architecture achieves higher resource efficiency
and consumes less I/O power, at the expense of lower through-
put.
In Fig. 3b, Vp = 4, Hp = 2, N = 16, two radix-4 blocks
are utilized. There is only one pipeline and Nc = 4, Np = 1.
Four inputs can be processed in parallel per cycle. Note
that there is no feedback path. The architecture achieves
high throughput by using more basic blocks and I/Os, while
resulting in higher power consumption.
We can also increase Vp by replicating the basic pipeline.
This replication allows several pipelines to work in parallel
to significantly increase the throughput at the cost of more
complex interconnections.
3.2.2. Architecture Parameters
Three architecture parameters that significantly affect energy-
efficiency are employed in our design and applied to differ-
ent components:
1. Type of memory element: BRAM or distributed RAM
(dist. RAM) can be used as memories. In our design,
both data buffers and twiddle factor lookup tables can
be implemented using different memory elements.
2. Type of interconnection: three different types of inter-
connection (see Fig.4) are used for implementation of
data permutation blocks, including crossbar network,
complete binary tree, as well as dynamic network.
3. Pipeline depth: Both adder/subtractors and DSP slices
in FPGA can be deep pipelined by inserting registers,
so we parameterized the arithmetic units and multi-
pliers with pipeline depth in our design to balance the
performance and resource usage.
According to the FPGA manufacturers user guide [14],
BRAM consumes less power than dist. RAM when used for
large size memories. Hence this characteristic can be uti-
lized to trade-off between power and performance for vari-
ous problem sizes.
As there are 2 ×m × (Hp + 1) (when Vp = 4, m = 1,
otherwise m = 0) permutation modules, using different in-
terconnection networks can significantly affect the energy
efficiency of the designs. The physical layout of the com-
plete binary tree is similar with that of a crossbar network,
while it can be inserted with more pipeline registers between
the layers of tree. The dynamic network can be implemented
by using shift registers. Among the three types of intercon-
nections, dynamic network has high performance but greater
power consumption, crossbar network consumes the least
resources and power but has a long wire delay, and complete
binary tree has simpler routing which improves performance
but at the expense of greater area usage.
4. EXPERIMENTAL RESULTS AND ANALYSIS
4.1. Experimental Setup
In this section, we present a detailed analysis of several im-
plementation experiments by varying the parameters. All
the designs were implemented in Verilog on Virtex-7 FPGA
(XC7VX980T, speed grade -2L) using Xilinx ISE 14.4. In-
puts are 16-bit fixed point complex numbers. The input test
vectors for simulation were randomly generated and had an
average toggle rate of 50%. We used the VCD file (value
change dump file) as input to Xilinx XPower Analyzer to
produce accurate power dissipation estimation [14]. For all
the evaluated designs, the operating frequency is set to 333
MHz.
4.2. Performance Metrics
Two metrics for performance evaluation are considered in
this paper:
1. Energy efficiency is defined as the number of oper-
ations per unit energy (Energy efficiency = Number
of operations / Energy). For N -point Radix-4 FFT,
Energy efficiency is given by ( 174 N log2 N ) / Energy.
Energy is the product of the average power dissipation
of the design and the latency of FFT computation.
2. Energy × Area × Time (EAT) is measured as the prod-
uct of three key metrics: energy, area, and time. When
given the same problem size, we use EAT ratio for
performance comparison between different designs.
Area is the area usage of the design, i.e. the num-
ber of LUTs or flip-flops (the larger one will be cho-
sen) occupied by the entire design. The BRAM slides
will be transferred to a certain amount of LUTs based
on the memory size, hence we can obtain the area of
BRAMs. Time is the latency for pipelining N-point
FFT.
4.3. Design space exploration
In this section, we first explore the design space by varying
algorithm mapping parameters. Then the parameter values
are chosen according to the experimental results. Based on
4
Fig. 5: Energy efficiency for various Hp with varying N for
the dist. RAM based design
Fig. 6: Energy efficiency for various Hp with varying N for
the BRAM based design
that, we explore the energy-efficient design (denoted empir-
ically optimized design) by varying the architecture param-
eters empirically. Both the dist. RAM based design and the
BRAM based design are used in this experiment. The effects
of the design parameters on energy efficiency are demon-
strated by using the proposed performance metrics.
4.3.1. Algorithm mapping level exploration
A. Horizontal Parallelism
In this experiment, we explored energy efficiency while
varying horizontal parallelism, and Vp = 4, Nc = 4, Np =1. The range of Hp is [1, log4 N ]. The energy efficiency
for various Hp are shown in Fig. 5 and Fig. 6 respectively.
Based on the experimental results, we have the following
observations:
• For the considered problem sizes, increasing Hp could
significantly improve energy efficiency for all designs.
Despite the required extra hardware to unfold the FFT
horizontally, the reduced latency of FFT computation
enables the design to outperform the original design.
• As N grows, the energy efficiency of the dist. RAM
based design declines, whereas, that of the BRAM
based design increases. The reason for that is dist.
RAM power increases significantly with memory size,
Fig. 7: Energy efficiency for various Vp with varying N for
the dist. RAM based design
Fig. 8: Energy efficiency for various Vp with varying N for
the BRAM based design
however, BRAM power is mainly decided by the num-
ber of used BRAM slides [14].
• For the dist. RAM based design, the improvement in
energy efficiency brought by increasing Hp is sensi-
tive to N . For example, when N = 1024, doubling
Hp only leads to a slight increase in energy efficiency.
Thus, reducing Hp to save area could be a feasible
alternative for larger size problems.
• The improvement in energy efficiency brought by in-
creasing Hp for BRAM based designs is not sensitive
to N . Reducing Hp to save area can lead to a sig-
nificant decline in energy efficiency for any problem
size.
B. Vertical Parallelism
Vertical parallelism is determined by three different val-
ues: radix value (fixed at 4), Nc, and Np. Hp was set as
log4 N . Nc and Np were varied for evaluation. Both dist.
RAM and BRAM based designs were evaluated. The en-
ergy efficiency for various Vp are shown in Fig. 7 and Fig. 8.
Note that the maximum Vp is limited by available number of
I/O pins. In this experiment, we have the following observa-
tions:
• BRAM based design is more scalable than dist. RAM
based design with respect to energy efficiency. When
5
Table 1: Architecture parameters of designs for comparison
Memorytype
Interconnectionnetwork
Pipelinestages
Type Components Multiplier Adder
Design A Dist. RAMDynamicnetwork
Regitsers 5 2
Empiricallyoptimized design
Dist. RAMor BRAM
Crossbarnetwork
LUTs 3 2
Design C BRAMCompletebinary tree
LUTs+Registers
2 1
(a) Dist. RAM based design (b) BRAM based design
Fig. 9: Power profile of 1024-point FFT architecture
N ≥ 64, energy efficiency starts to decline for dist.
RAM based designs due to high power consumption
with increasing memory size.
• Increasing Nc instead of Np can improve energy ef-
ficiency with less hardware resource since increasing
Nc only requires extra data buffers while increasing
Np requires replicating the full pipeline.
• When given a loose area constraint, we can improve
energy efficiency and throughput by increasing Np.
Although increasing Np leads to high power and re-
source consumption, the boosted throughput can off-
sets these disadvantages.
4.3.2. Architecture level exploration
In this section, we explore an energy efficient design (empir-
ically optimized design) at the architecture level. We choose
Vp = 4 and Hp = log4 N based on conclusions from the
previous experiments.
A. Energy hot spots
As shown in Fig. 9a, the dominant portion of the en-
tire power is consumed by the data buffers for 1024-point
FFT. This indicates that BRAM can be utilized to improve
energy efficiency for large values of N . Fig. 9b shows that
the percentage of I/O power and static power in the entire
power increases significantly for BRAM-based designs. As
I/O power and static power are constants here, this indicates
a power decline of the main design components by using
BRAMs. It also suggests that I/Os consume a large portion
of power for BRAM based design.
B. Empirically optimized design
Fig. 10: Energy efficiency of the empirically optimized de-
sign and the baseline designs
We first perform the analysis of effects of the architec-
ture parameters on energy, performance, and area. The anal-
ysis below has been applied to choose the architecture pa-
rameter values to achieve the empirically optimized design
in our experiment:
• Energy: Reducing the number of registers can signif-
icantly reduce signal power, which is dominant in the
dynamic power. Crossbar network can be evaluated to
increase energy efficiency.
• Performance: Using BRAM can lead to a decline in
peak operating frequency. For large values of N, when
using BRAMs, extra pipeline stages can be used to
solve the performance degradation issue.
• Area: Area usage of pipeline registers is dominant in
the entire design area. Pipeline registers can be bal-
anced to obtain trade offs between area and perfor-
mance.
As shown in Table1, we use two baseline designs to
compare with our proposed empirically optimized design.
The architecture parameters of the designs for comparison
are shown in Table1. The comparison results of the de-
signs on energy efficiency are shown in Fig.10. It shows
that the energy efficiency can be improved up to 27% by the
proposed empirically optimized design, compared with the
other two baseline designs.
4.4. Performance comparison
We use the SPIRAL FFT IP core to compare with our pro-
posed empirically optimized design. The SPIRAL FFT IP
cores are high performance FFT designs based on streaming
architecture. The data permutation block in their designs has
been mathematically proved to be control-cost optimal [15].
By using their provided tools, customized FFT soft IP cores
can be automatically generated in synthesizable RTL Ver-
ilog with user inputs [11]. The available parameters of the
DFT core generator include transform size, data precision,
and streaming width. In this comparison, we use the dist.
RAM based design for N ≤ 64 and the BRAM based de-
sign for N > 64. For the design from SPIRAL, the codes
6
Fig. 11: Comparison between the proposed empirically op-
timized design and the SPIRAL FFT IP Cores for EAT and
energy efficiency
of N -point (16-bit fixed point) FFT are automatically gen-
erated by the SPIRAL Core generator. The architecture is
fully streaming and the data are presented in their natural
order. As shown in Fig. 11, our proposed design improves
energy-efficiency by 8% to 28% and EAT by 23% to 38%,
respectively, compared with the SPIRAL FFT IP Cores.
5. CONCLUSION
We presented a parameterized architecture for energy effi-
cient implementation of the Radix-4 Cooley-Tukey FFT al-
gorithm. The effect of the two-level parameters on energy-
efficiency was demonstrated by using design space explo-
ration. We studied the power consumption of the compo-
nents for various problem sizes, and proposed our empir-
ically optimized design by empirical selection of architec-
ture parameter values. Compared with the state-of-the-art
design, our optimized architectures achieve up to 28% and
38% improvement in the energy efficiency and EAT respec-
tively. In the future we plan to work on an accurate high-
level performance model for energy-efficiency estimation,
which can be used to accelerate design space exploration to
obtain an energy efficient design.
6. REFERENCES
[1] N. Shirazi, P. M. Athanas, and A. L. Abbott, “Implemen-
tation of a 2-D Fast Fourier Transform on an FPGA-Based
Custom Computing Machine,” in Proceedings of Field-
Programmable Logic and Applications, 1995, pp. 282–292.
[2] D. Chen, G. Yao, C. Koc, and R. Cheung, “Low complex-
ity and hardware-friendly spectral modular multiplication,”
in Proceedings of Field-Programmable Technology (FPT),
2012, pp. 368–375.
[3] E. H. Wold and A. M. Despain, “Pipeline and parallel-
pipeline FFT processors for VLSI implementations,” IEEE
Transactions on Computers, vol. 100, no. 5, pp. 414–426,
1984.
[4] S. He and M. Torkelson, “A new approach to pipeline FFT
processor,” in Proceedings of IPPS ’96, pp. 766–770.
[5] G. Bi and E. Jones, “A pipelined FFT processor for word-
sequential data,” IEEE Transactions on Acoustics, Speech
and Signal Processing, vol. 37, no. 12, pp. 1982–1985, 1989.
[6] L. R. Rabiner and B. Gold, “Theory and application of digital
signal processing,” Englewood Cliffs, NJ, Prentice-Hall, Inc.,
1975. 777 p., vol. 1.
[7] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang,
“Energy-efficient signal processing using FPGAs,” in Pro-
ceedings of FPGA ’03, 2003, pp. 225–234.
[8] D. Aravind and A. Sudarsanam, “High level - application
analysis techniques architectures - to explore design possi-
bilities for reduced reconfiguration area overheads in FPGAs
executing compute intensive applications,” in Proceedings of
IPDPS, 2005, pp. 158a–158a.
[9] B. Baas, “A low-power, high-performance, 1024-point FFT
processor,” IEEE Journal of Solid-State Circuits, vol. 34,
no. 3, pp. 380–387, 1999.
[10] C.-W. J. Wen-Chang Yeh, “High-speed and low-power split-
radix FFT,” IEEE Transactions on Signal Processing, vol. 51,
no. 3, pp. 864–874, 2003.
[11] G. Nordin, P. A. Milder, J. C. Hoe, and M. Puschel, “Au-
tomatic generation of customized Discrete Fourier Trans-
form IPs,” in Proceedings of Design Automation Conference
(DAC), 2005, pp. 471–474.
[12] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto,
Y. Okuno, and K. Arimoto, “A high-performance and energy-
efficient FFT implementation on super parallel processor
(MX) for mobile multimedia applications,” in Proceedings of
Intelligent Signal Processing and Communications Systems,
2009, pp. 1–4.
[13] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto,
“Numerical analysis of dynamic snr management by con-
trolling dsp calculation precision for energy-efficient ofdm-
pon,” Photonics Technology Letters, IEEE, vol. 24, no. 23,
pp. 2132–2135, 2012.
[14] “XST User Guide for Virtex-6, Spartan-6, and 7 Series De-
vices,” http://www.xilinx.com/support/documentation.
[15] M. Puschel, P. A. Milder, and J. C. Hoe, “Permuting stream-
ing data using rams,” Journal of the ACM, vol. 56, no. 2, pp.
10:1–10:34, 2009.
7