7
ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 90089 Email: {renchen, hoangle, prasanna}@usc.edu ABSTRACT In this paper, we revisit the classic Fast Fourier Trans- form (FFT) for energy efficient designs on FPGAs. A pa- rameterized FFT architecture is proposed to identify the de- sign trade-offs in achieving energy efficiency. We first per- form design space exploration by varying the algorithm map- ping parameters, such as the degree of vertical and horizon- tal parallelism, that characterize decomposition based FFT algorithms. Then we explore an energy efficient design by empirical selection on the values of the chosen architec- ture parameters, including the type of memory elements, the type of interconnection network and the number of pipeline stages. The trade offs between energy, area, and time are analyzed using two performance metrics: the energy effi- ciency (defined as the number of operations per Joule) and the Energy×Area×Time (EAT) composite metric. From the experimental results, a design space is generated to demon- strate the effect of these parameters on the various perfor- mance metrics. For N -point FFT (16 N 1024), our designs achieve up to 28% and 38% improvement in the energy efficiency and EAT, respectively, compared with a state-of-the-art design. 1. INTRODUCTION FPGA is a promising implementation technology for com- putationally intensive applications such as signal, image, and network processing tasks [1, 2]. State-of-the-art FPGAs of- fer high operating frequency, unprecedented logic density and a host of other features. As FPGAs are programmed specifically for the problem to be solved, they can achieve higher performance with lower power consumption than gen- eral purpose processors. Fast Fourier Transform (FFT) is one of the most fre- quently used kernels in a wide variety of image and sig- nal processing applications. Various derivative FFT algo- rithms have been proposed and developed. Radix-x Cooley- Tukey algorithm is one of the most popular algorithms for This work has been funded by DARPA under grant number HR0011- 12-2-0023. hardware implementation [3, 4, 5, 6]. Most hardware so- lutions for Radix-x FFT fall into the following categories: delay feedback or delay commutator architectures [4], such as Radix-2 2 single-path delay feedback FFT [4] and Radix- 4 single-path delay commutator FFT [5]. By focusing on circuit level optimizations, these solutions achieve improve- ment either in throughput, area, or power. Energy efficiency is a key design metric. To obtain an energy efficient design for FFT, we analyze the trade-offs between energy, area, and time for fixed-point FFT on a parameterized architecture, using Cooley-Tukey algorithm. Energy efficiency can be achieved both at the algorithm map- ping level and the architecture level [7, 8]. Optimizing at these two levels allows power to be effectively traded off with other performance parameters. For example, a design consuming 2× power but achieving 3× system throughput is actually 50% more energy efficient than the original design. We present the architecture design space with respect to en- ergy efficiency at the algorithm mapping level. By empirical selection of the proposed architecture parameter values, we explore an energy efficient design at the architecture level. In this paper, we make the following contributions: 1. A parameterized FFT architecture using the Radix-4 Cooley-Tukey algorithm (Section 3.1). 2. A design space that demonstrates the effect of the pa- rameters on the Energy×Area×Time (EAT) compos- ite metric and the energy efficiency (Section 4.3.2). 3. Demonstrate improved energy efficiency of the pro- posed design by identifying energy hot-spots and vary- ing the chosen architecture parameters (Section 4.3.2). 4. Optimized designs achieving significant improvement in energy efficiency compared with a state-of-the-art design (Section 4.4). The rest of the paper is organized as follows. Section 2 covers the background and related work. Section 3 describes the proposed parameterized architecture and its implemen- tation on FPGA. Section 4 presents experimental results and analysis. Section 5 concludes the paper. 1 978-1-4799-0004-6/13/$31.00 ©2013 IEEE

[IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE∗

Ren Chen, Hoang Le, and Viktor K. Prasanna

Ming Hsieh Department of Electrical Engineering

University of Southern California, Los Angeles, USA 90089

Email: {renchen, hoangle, prasanna}@usc.edu

ABSTRACT

In this paper, we revisit the classic Fast Fourier Trans-

form (FFT) for energy efficient designs on FPGAs. A pa-

rameterized FFT architecture is proposed to identify the de-

sign trade-offs in achieving energy efficiency. We first per-

form design space exploration by varying the algorithm map-

ping parameters, such as the degree of vertical and horizon-

tal parallelism, that characterize decomposition based FFT

algorithms. Then we explore an energy efficient design by

empirical selection on the values of the chosen architec-

ture parameters, including the type of memory elements, the

type of interconnection network and the number of pipeline

stages. The trade offs between energy, area, and time are

analyzed using two performance metrics: the energy effi-

ciency (defined as the number of operations per Joule) and

the Energy×Area×Time (EAT) composite metric. From the

experimental results, a design space is generated to demon-

strate the effect of these parameters on the various perfor-

mance metrics. For N -point FFT (16 ≤ N ≤ 1024), our

designs achieve up to 28% and 38% improvement in the

energy efficiency and EAT, respectively, compared with a

state-of-the-art design.

1. INTRODUCTION

FPGA is a promising implementation technology for com-

putationally intensive applications such as signal, image, and

network processing tasks [1, 2]. State-of-the-art FPGAs of-

fer high operating frequency, unprecedented logic density

and a host of other features. As FPGAs are programmed

specifically for the problem to be solved, they can achieve

higher performance with lower power consumption than gen-

eral purpose processors.

Fast Fourier Transform (FFT) is one of the most fre-

quently used kernels in a wide variety of image and sig-

nal processing applications. Various derivative FFT algo-

rithms have been proposed and developed. Radix-x Cooley-

Tukey algorithm is one of the most popular algorithms for

∗This work has been funded by DARPA under grant number HR0011-

12-2-0023.

hardware implementation [3, 4, 5, 6]. Most hardware so-

lutions for Radix-x FFT fall into the following categories:

delay feedback or delay commutator architectures [4], such

as Radix-22 single-path delay feedback FFT [4] and Radix-

4 single-path delay commutator FFT [5]. By focusing on

circuit level optimizations, these solutions achieve improve-

ment either in throughput, area, or power.

Energy efficiency is a key design metric. To obtain an

energy efficient design for FFT, we analyze the trade-offs

between energy, area, and time for fixed-point FFT on a

parameterized architecture, using Cooley-Tukey algorithm.

Energy efficiency can be achieved both at the algorithm map-

ping level and the architecture level [7, 8]. Optimizing at

these two levels allows power to be effectively traded off

with other performance parameters. For example, a design

consuming 2× power but achieving 3× system throughput is

actually 50% more energy efficient than the original design.

We present the architecture design space with respect to en-

ergy efficiency at the algorithm mapping level. By empirical

selection of the proposed architecture parameter values, we

explore an energy efficient design at the architecture level.

In this paper, we make the following contributions:

1. A parameterized FFT architecture using the Radix-4

Cooley-Tukey algorithm (Section 3.1).

2. A design space that demonstrates the effect of the pa-

rameters on the Energy×Area×Time (EAT) compos-

ite metric and the energy efficiency (Section 4.3.2).

3. Demonstrate improved energy efficiency of the pro-

posed design by identifying energy hot-spots and vary-

ing the chosen architecture parameters (Section 4.3.2).

4. Optimized designs achieving significant improvement

in energy efficiency compared with a state-of-the-art

design (Section 4.4).

The rest of the paper is organized as follows. Section 2

covers the background and related work. Section 3 describes

the proposed parameterized architecture and its implemen-

tation on FPGA. Section 4 presents experimental results and

analysis. Section 5 concludes the paper.

1

978-1-4799-0004-6/13/$31.00 ©2013 IEEE

Page 2: [IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

2. BACKGROUND AND RELATED WORK

2.1. Background

Given N complex numbers x0, ..., xN−1, Discrete Fourier

Transform (DFT) is computed as: Xk =∑N−1

n=0 xne−i2πk n

N ,k = 0, ..., N − 1. Radix-x Cooley-Tukey FFT is a well

known decomposition based algorithm for N-point DFT. In

this paper, we employ Radix-4 FFT for our design. The de-

scription of Radix-4 FFT is presented in Algorithm 1. In

terms of the number of real operations, the computational

complexity of N -point Radix-4 FFT is O(N log4 N). The

algorithm performs N -point FFT in N/m (m < N) cycles

using m Input/Output ports (I/Os) and log4 N radix blocks,

which are used for butterfly computations. The algorithm it-

eratively decomposes the entire problem into four subprob-

lems. This feature enables us to map the algorithm by fold-

Algorithm 1 Radix-4 FFT Algorithm

1: q = N/4; d = N/4;

2: for p := 0 to log4 N do

3: for k := 0 to 4p − 1 do

4: l = 4kq/4p; r = l + q/(4p − 1);5: tw1 = w[k]; tw2 = w[2k]; tw3 = w[3k];6: for i := l to r do

7: t0 = i; t1 = i+d/4p; t2 = i+2d/4p; t3 = i+3d/4p;8: do parallel

9: fp+1[t0] = fp[t0] + fp[t1] + fp[t2] + fp[t3];10: fp+1[t1] = fp[t0]− jfp[t1]− fp[t2] + jfp[t3];11: fp+1[t2] = fp[t0]− fp[t1] + fp[t2] + jfp[t3];12: fp+1[t3] = fp[t0] + jfp[t1]− fp[t2]− jfp[t3];13: end parallel

14: do parallel

15: fp+1[t0] = fp+1[t0];16: fp+1[t1] = tw1 × fp+1[t1];17: fp+1[t2] = tw2 × fp+1[t2];18: fp+1[t3] = tw3 × xp+1[t3];19: end parallel

20: end for

21: end for

22: end for

ing the FFT architecture vertically or horizontally, thus pro-

viding much freedom to implement various designs on FP-

GAs. We propose our parameterized architecture in Section

3.2 based on this feature.

2.2. Related Work

To the best of our knowledge, there has been no previous

work targeted at exploring the design space for energy effi-

ciency of FFT at both the algorithm mapping level and the

architecture level on FPGAs. Existing work has mainly fo-

cused on optimizing the performance, power and area of the

design at the circuit level.

An energy-efficient 1024-point FFT processor was de-

veloped in [9]. Cache-based FFT algorithm was proposed

to achieve low power and high performance. Energy-time

performance metric was evaluated at various processor op-

eration points. In [10], a high-speed and low-power FFT ar-

chitecture was presented. They presented a delay balanced

pipeline architecture based on split-radix algorithm. Algo-

rithms for reducing computation complexity were explored

and the architecture was evaluated in area, power and timing

performance.

Based on Radix-x FFT, various pipeline FFT architec-

tures have been proposed, such as Radix-2 single-path de-

lay feedback FFT [3], Radix-4 single-path delay commuta-

tor FFT [5], Radix-2 multi-path delay commutator FFT [6],

and Radix-22 single-path delay feedback FFT [4]. These

architectures can achieve high throughput per unit area with

single-path or multi-path pipelines, but energy efficiency has

not been explored in these works.

In [11], a parameterized soft core generator for high

throughput DFT was developed. This generator can auto-

matically produce an optimized design with user inputs for

performance and resource constraints. However, energy ef-

ficiency is not considered in this work. In [7], the author

presented a parameterized energy efficient FFT architecture.

Their design is optimized to achieve high energy efficiency

by varying the architecture parameters. Some energy effi-

cient design techniques, such as clock gating and memory

binding, are also employed in their work.

Other than FPGA, there are also some techniques for en-

ergy efficient FFT presented based on other different plat-

forms [12, 13]. However, it is not clear how to apply these

techniques on FPGAs. In this work, we extend the work of

[7] by design space exploration at multiple levels. The de-

sign space exploration is performed on the current state-of-

the-art FPGAs. By exploring the energy-performance-area

trade-offs at mutiple levels, we obtain an energy efficient

design for FFT.

3. ARCHITECTURE AND IMPLEMENTATIONS

3.1. Architecture building blocks

The proposed N -point FFT architecture is based on the Radix-

4 Cooley-Tukey FFT algorithm. Note that the choice of the

radix affects energy efficiency of the design. Compared with

Radix-2 algorithm, Radix-4 uses fewer multiply operations.

The proposed architecture consists of five building blocks

(see Fig.1): Radix-4 block (R4), Data buffer, Data path per-

mutation (PER), Parallel-to-serial/serial-to-parallel (PS/SP)

multiplexer, and twiddle factor computation (TWC). A com-

plete design for N -point FFT can be obtained by a combi-

nation of the basic blocks.

A. Radix-4 block

In this module, 16 signed adder/subtractors are used to

complete butterfly computations. It takes four inputs and

generates four outputs in parallel. Each input data contains

real and imaginary components. The data outputs of R4 will

be used by the twiddle factor computation block except in

the last stage (see Fig. 1a).

B. Data buffer

2

Page 3: [IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

(a) (b) (c) (d) (e)

Fig. 1: (a) Radix block, (b) Data buffer, (c) Data path permu-

tation (PER) , (d) Parallel-to-serial/serial-to-parallel MUX

(PS/SP), (e) Twiddle factor computation (TWC)

Fig. 2: Data permutation in the data buffers for 16-point FFT

Each data buffer consists of a dual-port RAM having

N/m (m equals to the number of I/Os) entries. Data is

written into one port and read from the other port simultane-

ously. Fig. 2 shows the data buffers for 16-point pipelining

FFT. In four cycles, 16 permutated data inputs are fed into

the data buffers. In each cycle, with alternating locations,

four data outputs are read in parallel. For different architec-

tural parameters, the read and write addresses are generated

with different strides. For example, in Fig. 2, four data in-

puts (X0, X4, X8, X12) are written in cycle 0, cycle 1, cycle

2, and cycle 3 respectively. Then they are output simultane-

ously in cycle 4.

C. Data permutation block

Parallel input data are required to be permutated before

being processed by the subsequent modules. Fig. 2 shows

the data permutation for 16-point FFT. In the first cycle, four

data inputs (X0, X1, X2, X3) are fed into the first entry of

each data buffer without permutation. In the second cycle,

another four data inputs are written into the second entry of

each data buffer with one location permutated. The parallel

output data (Xi, Xi+4, Xi+8, X(i+12)mod16, i = 0, 1, 2, 3)are stored in different RAMs after four cycles. These per-

mutations are repeated every four cycles.

D. PS/SP module

This module is used to multiplex serial/parallel input

data to output in parallel/serial respectively. As shown in

Fig. 3a, the number of I/Os is limited to one, but the radix-4

block still operates on four data inputs in parallel, thus the

PS/SP module is employed to match the data rate both be-

fore and after the radix-4 block.

(a) Hp = 1,Vp = 1

(b) Hp = 2,Vp = 4

Fig. 3: Parameterized Architectures for 16-point FFT

E. Twiddle factor computation

This module consists of two blocks: the twiddle factor

generation block and the complex number multiplier block.

The twiddle factor generation block includes several lookup

tables for storing twiddle factor coefficients, where the data

read addresses will be updated with the control signals. The

size of the lookup tables will increase with the problem size.

The complex number multiplier block consists of three mul-

tipliers and three adder/subtractors.

3.2. Parameterized FFT Architecture

3.2.1. Algorithm Mapping Parameters

Decomposition based Radix-4 FFT offers much flexibility

to map various architectures. Folding the FFT architecture

enables the radix-4 blocks to be reused iteratively to save

area, while unfolding the FFT increases spatial parallelism.

Hence we use two algorithm mapping parameters that char-

acterize the decomposition-based N -point FFT algorithm in

our design:

1. Horizontal Parallelism (Hp): determines the number

of radix-4 blocks concatenated horizontally (1 ≤ Hp ≤log4 N ).

2. Vertical Parallelism (Vp): determines the number of

parallel I/Os (1 ≤ Vp ≤ N ). Vp is determined by

the number of I/Os per pipeline (Nc) and the number

of parallel pipelines (Np), and Vp = Nc × Np. Each

pipeline is a row of horizontally concatenated radix-4

blocks.

We adapt these two architectural parameters for an en-

ergy efficient design. Two different architectures are pre-

sented in Fig. 3. In Fig. 3a, Vp = Nc = Np = 1, Hp =1, N = 16, one radix-4 block is employed and iteratively

3

Page 4: [IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

(a) (b) (c)

Fig. 4: (a) Crossbar network, (b) Complete binary tree, (c)

Dynamic network

used by two stages, and one input data is processed per cy-

cle. This architecture achieves higher resource efficiency

and consumes less I/O power, at the expense of lower through-

put.

In Fig. 3b, Vp = 4, Hp = 2, N = 16, two radix-4 blocks

are utilized. There is only one pipeline and Nc = 4, Np = 1.

Four inputs can be processed in parallel per cycle. Note

that there is no feedback path. The architecture achieves

high throughput by using more basic blocks and I/Os, while

resulting in higher power consumption.

We can also increase Vp by replicating the basic pipeline.

This replication allows several pipelines to work in parallel

to significantly increase the throughput at the cost of more

complex interconnections.

3.2.2. Architecture Parameters

Three architecture parameters that significantly affect energy-

efficiency are employed in our design and applied to differ-

ent components:

1. Type of memory element: BRAM or distributed RAM

(dist. RAM) can be used as memories. In our design,

both data buffers and twiddle factor lookup tables can

be implemented using different memory elements.

2. Type of interconnection: three different types of inter-

connection (see Fig.4) are used for implementation of

data permutation blocks, including crossbar network,

complete binary tree, as well as dynamic network.

3. Pipeline depth: Both adder/subtractors and DSP slices

in FPGA can be deep pipelined by inserting registers,

so we parameterized the arithmetic units and multi-

pliers with pipeline depth in our design to balance the

performance and resource usage.

According to the FPGA manufacturers user guide [14],

BRAM consumes less power than dist. RAM when used for

large size memories. Hence this characteristic can be uti-

lized to trade-off between power and performance for vari-

ous problem sizes.

As there are 2 ×m × (Hp + 1) (when Vp = 4, m = 1,

otherwise m = 0) permutation modules, using different in-

terconnection networks can significantly affect the energy

efficiency of the designs. The physical layout of the com-

plete binary tree is similar with that of a crossbar network,

while it can be inserted with more pipeline registers between

the layers of tree. The dynamic network can be implemented

by using shift registers. Among the three types of intercon-

nections, dynamic network has high performance but greater

power consumption, crossbar network consumes the least

resources and power but has a long wire delay, and complete

binary tree has simpler routing which improves performance

but at the expense of greater area usage.

4. EXPERIMENTAL RESULTS AND ANALYSIS

4.1. Experimental Setup

In this section, we present a detailed analysis of several im-

plementation experiments by varying the parameters. All

the designs were implemented in Verilog on Virtex-7 FPGA

(XC7VX980T, speed grade -2L) using Xilinx ISE 14.4. In-

puts are 16-bit fixed point complex numbers. The input test

vectors for simulation were randomly generated and had an

average toggle rate of 50%. We used the VCD file (value

change dump file) as input to Xilinx XPower Analyzer to

produce accurate power dissipation estimation [14]. For all

the evaluated designs, the operating frequency is set to 333

MHz.

4.2. Performance Metrics

Two metrics for performance evaluation are considered in

this paper:

1. Energy efficiency is defined as the number of oper-

ations per unit energy (Energy efficiency = Number

of operations / Energy). For N -point Radix-4 FFT,

Energy efficiency is given by ( 174 N log2 N ) / Energy.

Energy is the product of the average power dissipation

of the design and the latency of FFT computation.

2. Energy × Area × Time (EAT) is measured as the prod-

uct of three key metrics: energy, area, and time. When

given the same problem size, we use EAT ratio for

performance comparison between different designs.

Area is the area usage of the design, i.e. the num-

ber of LUTs or flip-flops (the larger one will be cho-

sen) occupied by the entire design. The BRAM slides

will be transferred to a certain amount of LUTs based

on the memory size, hence we can obtain the area of

BRAMs. Time is the latency for pipelining N-point

FFT.

4.3. Design space exploration

In this section, we first explore the design space by varying

algorithm mapping parameters. Then the parameter values

are chosen according to the experimental results. Based on

4

Page 5: [IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

Fig. 5: Energy efficiency for various Hp with varying N for

the dist. RAM based design

Fig. 6: Energy efficiency for various Hp with varying N for

the BRAM based design

that, we explore the energy-efficient design (denoted empir-

ically optimized design) by varying the architecture param-

eters empirically. Both the dist. RAM based design and the

BRAM based design are used in this experiment. The effects

of the design parameters on energy efficiency are demon-

strated by using the proposed performance metrics.

4.3.1. Algorithm mapping level exploration

A. Horizontal Parallelism

In this experiment, we explored energy efficiency while

varying horizontal parallelism, and Vp = 4, Nc = 4, Np =1. The range of Hp is [1, log4 N ]. The energy efficiency

for various Hp are shown in Fig. 5 and Fig. 6 respectively.

Based on the experimental results, we have the following

observations:

• For the considered problem sizes, increasing Hp could

significantly improve energy efficiency for all designs.

Despite the required extra hardware to unfold the FFT

horizontally, the reduced latency of FFT computation

enables the design to outperform the original design.

• As N grows, the energy efficiency of the dist. RAM

based design declines, whereas, that of the BRAM

based design increases. The reason for that is dist.

RAM power increases significantly with memory size,

Fig. 7: Energy efficiency for various Vp with varying N for

the dist. RAM based design

Fig. 8: Energy efficiency for various Vp with varying N for

the BRAM based design

however, BRAM power is mainly decided by the num-

ber of used BRAM slides [14].

• For the dist. RAM based design, the improvement in

energy efficiency brought by increasing Hp is sensi-

tive to N . For example, when N = 1024, doubling

Hp only leads to a slight increase in energy efficiency.

Thus, reducing Hp to save area could be a feasible

alternative for larger size problems.

• The improvement in energy efficiency brought by in-

creasing Hp for BRAM based designs is not sensitive

to N . Reducing Hp to save area can lead to a sig-

nificant decline in energy efficiency for any problem

size.

B. Vertical Parallelism

Vertical parallelism is determined by three different val-

ues: radix value (fixed at 4), Nc, and Np. Hp was set as

log4 N . Nc and Np were varied for evaluation. Both dist.

RAM and BRAM based designs were evaluated. The en-

ergy efficiency for various Vp are shown in Fig. 7 and Fig. 8.

Note that the maximum Vp is limited by available number of

I/O pins. In this experiment, we have the following observa-

tions:

• BRAM based design is more scalable than dist. RAM

based design with respect to energy efficiency. When

5

Page 6: [IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

Table 1: Architecture parameters of designs for comparison

Memorytype

Interconnectionnetwork

Pipelinestages

Type Components Multiplier Adder

Design A Dist. RAMDynamicnetwork

Regitsers 5 2

Empiricallyoptimized design

Dist. RAMor BRAM

Crossbarnetwork

LUTs 3 2

Design C BRAMCompletebinary tree

LUTs+Registers

2 1

(a) Dist. RAM based design (b) BRAM based design

Fig. 9: Power profile of 1024-point FFT architecture

N ≥ 64, energy efficiency starts to decline for dist.

RAM based designs due to high power consumption

with increasing memory size.

• Increasing Nc instead of Np can improve energy ef-

ficiency with less hardware resource since increasing

Nc only requires extra data buffers while increasing

Np requires replicating the full pipeline.

• When given a loose area constraint, we can improve

energy efficiency and throughput by increasing Np.

Although increasing Np leads to high power and re-

source consumption, the boosted throughput can off-

sets these disadvantages.

4.3.2. Architecture level exploration

In this section, we explore an energy efficient design (empir-

ically optimized design) at the architecture level. We choose

Vp = 4 and Hp = log4 N based on conclusions from the

previous experiments.

A. Energy hot spots

As shown in Fig. 9a, the dominant portion of the en-

tire power is consumed by the data buffers for 1024-point

FFT. This indicates that BRAM can be utilized to improve

energy efficiency for large values of N . Fig. 9b shows that

the percentage of I/O power and static power in the entire

power increases significantly for BRAM-based designs. As

I/O power and static power are constants here, this indicates

a power decline of the main design components by using

BRAMs. It also suggests that I/Os consume a large portion

of power for BRAM based design.

B. Empirically optimized design

Fig. 10: Energy efficiency of the empirically optimized de-

sign and the baseline designs

We first perform the analysis of effects of the architec-

ture parameters on energy, performance, and area. The anal-

ysis below has been applied to choose the architecture pa-

rameter values to achieve the empirically optimized design

in our experiment:

• Energy: Reducing the number of registers can signif-

icantly reduce signal power, which is dominant in the

dynamic power. Crossbar network can be evaluated to

increase energy efficiency.

• Performance: Using BRAM can lead to a decline in

peak operating frequency. For large values of N, when

using BRAMs, extra pipeline stages can be used to

solve the performance degradation issue.

• Area: Area usage of pipeline registers is dominant in

the entire design area. Pipeline registers can be bal-

anced to obtain trade offs between area and perfor-

mance.

As shown in Table1, we use two baseline designs to

compare with our proposed empirically optimized design.

The architecture parameters of the designs for comparison

are shown in Table1. The comparison results of the de-

signs on energy efficiency are shown in Fig.10. It shows

that the energy efficiency can be improved up to 27% by the

proposed empirically optimized design, compared with the

other two baseline designs.

4.4. Performance comparison

We use the SPIRAL FFT IP core to compare with our pro-

posed empirically optimized design. The SPIRAL FFT IP

cores are high performance FFT designs based on streaming

architecture. The data permutation block in their designs has

been mathematically proved to be control-cost optimal [15].

By using their provided tools, customized FFT soft IP cores

can be automatically generated in synthesizable RTL Ver-

ilog with user inputs [11]. The available parameters of the

DFT core generator include transform size, data precision,

and streaming width. In this comparison, we use the dist.

RAM based design for N ≤ 64 and the BRAM based de-

sign for N > 64. For the design from SPIRAL, the codes

6

Page 7: [IEEE 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) - Porto, Portugal (2013.09.2-2013.09.4)] 2013 23rd International Conference on Field programmable

Fig. 11: Comparison between the proposed empirically op-

timized design and the SPIRAL FFT IP Cores for EAT and

energy efficiency

of N -point (16-bit fixed point) FFT are automatically gen-

erated by the SPIRAL Core generator. The architecture is

fully streaming and the data are presented in their natural

order. As shown in Fig. 11, our proposed design improves

energy-efficiency by 8% to 28% and EAT by 23% to 38%,

respectively, compared with the SPIRAL FFT IP Cores.

5. CONCLUSION

We presented a parameterized architecture for energy effi-

cient implementation of the Radix-4 Cooley-Tukey FFT al-

gorithm. The effect of the two-level parameters on energy-

efficiency was demonstrated by using design space explo-

ration. We studied the power consumption of the compo-

nents for various problem sizes, and proposed our empir-

ically optimized design by empirical selection of architec-

ture parameter values. Compared with the state-of-the-art

design, our optimized architectures achieve up to 28% and

38% improvement in the energy efficiency and EAT respec-

tively. In the future we plan to work on an accurate high-

level performance model for energy-efficiency estimation,

which can be used to accelerate design space exploration to

obtain an energy efficient design.

6. REFERENCES

[1] N. Shirazi, P. M. Athanas, and A. L. Abbott, “Implemen-

tation of a 2-D Fast Fourier Transform on an FPGA-Based

Custom Computing Machine,” in Proceedings of Field-

Programmable Logic and Applications, 1995, pp. 282–292.

[2] D. Chen, G. Yao, C. Koc, and R. Cheung, “Low complex-

ity and hardware-friendly spectral modular multiplication,”

in Proceedings of Field-Programmable Technology (FPT),

2012, pp. 368–375.

[3] E. H. Wold and A. M. Despain, “Pipeline and parallel-

pipeline FFT processors for VLSI implementations,” IEEE

Transactions on Computers, vol. 100, no. 5, pp. 414–426,

1984.

[4] S. He and M. Torkelson, “A new approach to pipeline FFT

processor,” in Proceedings of IPPS ’96, pp. 766–770.

[5] G. Bi and E. Jones, “A pipelined FFT processor for word-

sequential data,” IEEE Transactions on Acoustics, Speech

and Signal Processing, vol. 37, no. 12, pp. 1982–1985, 1989.

[6] L. R. Rabiner and B. Gold, “Theory and application of digital

signal processing,” Englewood Cliffs, NJ, Prentice-Hall, Inc.,

1975. 777 p., vol. 1.

[7] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang,

“Energy-efficient signal processing using FPGAs,” in Pro-

ceedings of FPGA ’03, 2003, pp. 225–234.

[8] D. Aravind and A. Sudarsanam, “High level - application

analysis techniques architectures - to explore design possi-

bilities for reduced reconfiguration area overheads in FPGAs

executing compute intensive applications,” in Proceedings of

IPDPS, 2005, pp. 158a–158a.

[9] B. Baas, “A low-power, high-performance, 1024-point FFT

processor,” IEEE Journal of Solid-State Circuits, vol. 34,

no. 3, pp. 380–387, 1999.

[10] C.-W. J. Wen-Chang Yeh, “High-speed and low-power split-

radix FFT,” IEEE Transactions on Signal Processing, vol. 51,

no. 3, pp. 864–874, 2003.

[11] G. Nordin, P. A. Milder, J. C. Hoe, and M. Puschel, “Au-

tomatic generation of customized Discrete Fourier Trans-

form IPs,” in Proceedings of Design Automation Conference

(DAC), 2005, pp. 471–474.

[12] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto,

Y. Okuno, and K. Arimoto, “A high-performance and energy-

efficient FFT implementation on super parallel processor

(MX) for mobile multimedia applications,” in Proceedings of

Intelligent Signal Processing and Communications Systems,

2009, pp. 1–4.

[13] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto,

“Numerical analysis of dynamic snr management by con-

trolling dsp calculation precision for energy-efficient ofdm-

pon,” Photonics Technology Letters, IEEE, vol. 24, no. 23,

pp. 2132–2135, 2012.

[14] “XST User Guide for Virtex-6, Spartan-6, and 7 Series De-

vices,” http://www.xilinx.com/support/documentation.

[15] M. Puschel, P. A. Milder, and J. C. Hoe, “Permuting stream-

ing data using rams,” Journal of the ACM, vol. 56, no. 2, pp.

10:1–10:34, 2009.

7