23
Circuits Syst Signal Process DOI 10.1007/s00034-011-9332-7 Design and Comparison of FFT VLSI Architectures for SoC Telecom Applications with Different Flexibility, Speed and Complexity Trade-Offs Sergio Saponara · Massimo Rovini · Luca Fanucci · Athanasios Karachalios · George Lentaris · Dionysios Reisis Received: 5 July 2010 / Revised: 16 June 2011 © Springer Science+Business Media, LLC 2011 Abstract The design of Fast Fourier Transform (FFT) integrated architectures for System-on-Chip (SoC) telecom applications is addressed in this paper. After review- ing the FFT processing requirements of wireless and wired Orthogonal Frequency Di- vision Multiplexing (OFDM) standards, including the emerging Multiple Input Mul- tiple Output (MIMO) and OFDM Access (OFDMA) schemes, three FFT architec- tures are proposed: a fully parallel, a pipelined cascade and an in-place variable-size architecture, which offer different trade-offs among flexibility, processing speed and complexity. Silicon implementation results and comparisons with the state-of-the- art prove that each macrocell outperforms the known works for a target application. The fully parallel is optimized for throughput requirements up to several GSamples/s enabling Ultra-wideband (UWB) communications by using all channels foreseen in the standard. The pipelined cascade macrocell minimizes complexity for large size FFTs sustaining throughput up to 100 MSamples/s. The in-place variable-size FFT S. Saponara ( ) · M. Rovini · L. Fanucci Department of Information Engineering, University of Pisa, Via G. Caruso 16, 56122 Pisa, Italy e-mail: [email protected] M. Rovini e-mail: [email protected] L. Fanucci e-mail: [email protected] A. Karachalios · G. Lentaris · D. Reisis Department of Physics, University of Athens, Panepistimiopolis, Zografou, 15784 Athens, Greece A. Karachalios e-mail: [email protected] G. Lentaris e-mail: [email protected] D. Reisis e-mail: [email protected]

Design and Comparison of FFT VLSI Architectures for SoC Telecom

Embed Size (px)

Citation preview

Page 1: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal ProcessDOI 10.1007/s00034-011-9332-7

Design and Comparison of FFT VLSI Architecturesfor SoC Telecom Applications with Different Flexibility,Speed and Complexity Trade-Offs

Sergio Saponara · Massimo Rovini · Luca Fanucci ·Athanasios Karachalios · George Lentaris ·Dionysios Reisis

Received: 5 July 2010 / Revised: 16 June 2011© Springer Science+Business Media, LLC 2011

Abstract The design of Fast Fourier Transform (FFT) integrated architectures forSystem-on-Chip (SoC) telecom applications is addressed in this paper. After review-ing the FFT processing requirements of wireless and wired Orthogonal Frequency Di-vision Multiplexing (OFDM) standards, including the emerging Multiple Input Mul-tiple Output (MIMO) and OFDM Access (OFDMA) schemes, three FFT architec-tures are proposed: a fully parallel, a pipelined cascade and an in-place variable-sizearchitecture, which offer different trade-offs among flexibility, processing speed andcomplexity. Silicon implementation results and comparisons with the state-of-the-art prove that each macrocell outperforms the known works for a target application.The fully parallel is optimized for throughput requirements up to several GSamples/senabling Ultra-wideband (UWB) communications by using all channels foreseen inthe standard. The pipelined cascade macrocell minimizes complexity for large sizeFFTs sustaining throughput up to 100 MSamples/s. The in-place variable-size FFT

S. Saponara (�) · M. Rovini · L. FanucciDepartment of Information Engineering, University of Pisa, Via G. Caruso 16, 56122 Pisa, Italye-mail: [email protected]

M. Rovinie-mail: [email protected]

L. Fanuccie-mail: [email protected]

A. Karachalios · G. Lentaris · D. ReisisDepartment of Physics, University of Athens, Panepistimiopolis, Zografou, 15784 Athens, Greece

A. Karachaliose-mail: [email protected]

G. Lentarise-mail: [email protected]

D. Reisise-mail: [email protected]

Page 2: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

macrocell stands for its flexibility by allowing run-time reconfigurability required inOFDMA schemes while attaining the required throughput to support MIMO commu-nications. The three architectures are also compared with common case-studies andtarget technology.

Keywords VLSI design · Fast Fourier Transform · System-on-Chip · OFDMtelecom systems

1 Introduction

In the evolving telecommunication applications, dedicated FFT/IFFT architecturesare required for the baseband processing. A plethora of such applications (see [1,6, 11, 12, 17, 18, 20, 22, 38, 41]) suggests the design of configurable FFT archi-tectures, capable of achieving high throughput but also keeping the gate complexityand power consumption relatively low. Aiming at accommodating these types of ap-plications, this paper proposes the design of different VLSI (Very Large Scale ofIntegration) FFT/IFFT architectures targeting different trade-offs among the aboveperformance metrics. Particularly, the design aspects allowing for optimized FPGA(Field Programmable Gate Array) implementations are considered. FPGAs providean attractive implementation platform for telecom applications, because they are ableto reconfigure at compilation and/or run time and hence support different wirelessstandards. Moreover, today’s FPGA designs extend their application range from pro-totyping platforms to user products, from fixed to mobile terminals: indeed, FPGAfamilies are available at the cost of few dollars for large volume market, while em-bedded FPGAs can be integrated as reconfigurable logic in System-on-Chips (SoCs).

The specifications of advanced OFDM-based standards for telecom systems leadto a wide configuration space to be faced by the FFT engine. The throughput may varyfrom few MSamples/s in xDSL (Digital Subscriber Line) modems for residential In-ternet connections (see [12, 41]) up to GSamples/s in UWB terminals for short-rangecommunication of multimedia contents [18]. The I/O data-width may vary from 4 or5 bits in UWB up to 16 bits in VDSL (Very high-speed DSL) or BPL (Broadband onPower Lines) applications. Similarly, the FFT size (i.e. the FFT length) varies from 64complex points in Wireless Local Area Network (WLAN) (see [20, 38]) or ADSL, to8192 in DVB (Digital Video Broadcasting) [22]. Moreover, the FFT engine should beconceived as a parametric IP (Intellectual Property) macrocell and, once integrated, itshould be still configurable at run time to support standards with multi-mode adaptivebehavior. As examples of such standards it is worth citing the Worldwide Interoper-ability for Microwave Access, WiMAX [2], or the 3rd Generation Partnership ProjectLong Term Evolution, 3GPP LTE [19], with FFT length ranging from 128 to 2048.

To achieve an unified view on the variety of possible design solutions meetingthe above requirements, this paper proposes different architectural approaches withrespect to the degree of parallelism, memory access strategy and machine arithmeticstyle. Furthermore, it shows their implementations and analyzes their advantages anddisadvantages in terms of performance, complexity and flexibility, considering FPGAdevices as target implementation technology. Exploiting different FFT processing

Page 3: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

schemes, each of the proposed architectures introduces design features allowing foran efficient support of a specific group of the aforementioned standards.

The paper is organized as follows. Section 2 reviews the OFDM communica-tion standards and the requirements for the FFT processing core. Section 3 proposesa massively parallel FFT architecture suitable for high-throughput applications (upto GSamples/s) such as UWB. Section 4 presents a configurable cascade FFT corewhich ensures an optimal trade-off between complexity and performance for applica-tions requiring large size FFTs (1024 complex points), such as DVB, and large data-widths, but with throughput requirements lower than one hundred of MSamples/s.Section 5 describes an in-place variable-length FFT core with parallel butterfly pro-cessors optimizing run-time reconfigurability and still supporting high-throughputapplications. Such architecture is suitable for emerging WiMAX terminals needingrun-time FFT length configuration and a computational throughput up to hundredsof MSamples/s to support Multi-Input Multi-Output (MIMO) communications. Im-plementation results of the above architectures on the same target technology andcomparisons between them are proposed in Sect. 6. Results are also compared withthe state-of-the-art of FFT VLSI designs for OFDM telecom applications. Conclu-sions are drawn in Sect. 7.

2 Overview on OFDM-Based Communication Standards

2.1 OFDM and MIMO-OFDM Architectures

The multi-carrier OFDM scheme has fostered the rise of several wireless and wiredcommunication standards including: Digital Broadcasting of Audio and Video con-tents, in Terrestrial and Handheld scenarios (DAB, DVB-T/H) (see [6, 22]); 802.16-d/e Wireless Metropolitan Area Network (WMAN), known respectively as fixed andmobile WiMAX [17], for wireless fast Internet access in metropolitan scenarios;xDSL [7, 12, 31, 41] and BPL [1, 3, 11] modem for fast Internet access through wiredchannels, the telephone line and the power line respectively; 802.11 a/n WLAN formedium range indoor networking [20, 26, 38]; UWB radio [18, 32, 36] for high datarate personal area network connectivity. The connectivity range covers short range us-ing UWB radio, mid range based on WLAN, BPL and VDSL and wide range throughDVB-T/H, DAB, WMAN and xDSL standards.

With respect to single-carrier modulation, OFDM-based systems offer enhancedrobustness against cross-talk, fading channels and multi-path distortion [5]. In OFDMsystems, channel equalization is simplified because the transmitted data are spreadacross orthogonal sub-carriers, hence OFDM can be viewed as the contributionof many narrow-band signals rather than a rapidly-modulated wideband signal. In802.16e, OFDM is also deployed as a multi-user access technology (OFDMA), wherecarriers are clustered in subsets dynamically assigned to each user. Therefore, thechannel capacity is shared among multiple users.

All the aforementioned standards exploit a similar baseband processing schemewhose core are an FFT processor, in charge of multi-carrier symbol demodulation atthe receiver (rx), plus an IFFT processor in charge of symbol modulation at the trans-mitter (tx). FFT and IFFT require roughly half of the total circuit complexity of the

Page 4: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

baseband processing in OFDM systems (see [6, 29]). During modulation (IFFT) thereis a cyclic extension of the symbol to insert a guard interval handling time-spreadingand eliminating inter symbol interference. The extraction of the cyclic prefix is doneat receiver side (FFT).

Note that FFT and IFFT operations can be merged in a single FFT/IFFT processorin case the communication is based on a time division duplexing (TDD) scheme,since the transceiver is working either in rx mode (demodulation by FFT) or in txmode (modulation by IFFT). In full-duplex transceivers adopting frequency divisionduplexing (FDD), with concurrent tx and rx, FFT and IFFT have to be implementedthrough different dedicated processors.

OFDM can be used in conjunction with MIMO techniques to increase the systemcapacity and/or the diversity gain (see [21, 25, 29]). The MIMO scheme, adoptedin emerging standards such as 802.16 WMAN and 802.11n WLAN, uses multipleantennas at both the receiver and the transmitter side to exploit spatial diversity and/orspatial multiplexing.

Spatial multiplexing increases the capacity of a MIMO link by simultaneouslytransmitting, from each tx antenna, independent data streams in the same time slotand frequency band. Multiple data steams are then differentiated at the receiver sideby using channel information about each propagation path. Because multiple datastreams are transmitted in parallel from different antennas, there is a linear increasein throughput for every pair of antennas added to the system (see [21, 29]).

In contrast to spatial multiplexing, spatial diversity increases the diversity orderof a MIMO link to mitigate fading by coding a signal across space and time usingspecial space-time code techniques such as the Alamouti code [21]. At the rx side,the multiple replicas of the signal are combined constructively to achieve a diversitygain.

Fig. 1 WLAN 2 × 2 MIMO-OFDM scheme

Page 5: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Figure 1 shows the baseband processing architecture for a 2 × 2 WLAN OFDMsystem in [21], featuring 2 rx paths and 2 tx paths. It is clear that the number of FFTand IFFT processing units to be integrated in the transceiver depends on the numberof rx and tx paths. In an M ×M OFDM-MIMO system, up to M different streams canbe transmitted concurrently over multiple antennas; to serve these streams, M FFT(and M IFFT) processing units working in parallel are required, thus increasing areaby a factor M . Alternatively, a lower number of P processors, with P ∈ [1,M], canbe used in a time-division way, but to sustain the same data throughput, the clockfrequency of the P processors should be increased by a factor M/P . Typical valuesfor M are 2 and 4; IEEE 802.11n and 802.16e standards require 2×2 MIMO schemesbut 4 × 4 is allowed. Also in TDD-MIMO systems, FFT and IFFT functions can bemerged in a single FFT/IFFT processor, while in FDD schemes different FFT andIFFT processors, working simultaneously, are required.

2.2 OFDM and MIMO-OFDM Processing Requirements

The different OFDM standards considered in this paper (UWB, WLAN 802.11 a/n,WMAN 802.16 d/e, DVB-T/H, DAB, VDSL, BPL) are characterized by differentrequirements for the FFT/IFFT processing in terms of I/O data-width, from 4 to 16bits, transform length, from 64 to 8192 points, and throughput, from 1 MSamples/s toseveral1 GSamples/s (see [1, 6, 11, 12, 17, 18, 20, 22, 38, 41]). Table 1 and Fig. 2 sum-marize the FFT/IFFT requirements, which range between two extremes: small FFTsize and high-throughput applications, e.g. UWB, or large-size but low-throughputapplications, e.g. DVB-T/H. Some standards support multiple modes, requiring dif-ferent configurations in terms of FFT length and throughput, to face different scenar-ios in terms of connected users, channel conditions, communication bandwidth andlatency. For such standards, the configuration with the highest size and throughputwas considered in Fig. 2. The proposed macrocells have been designed according toa design-for-reuse approach, as in [14–16, 33–35], and are parametric in terms of

Table 1 FFT requirements ofOFDM and M × M MIMOOFDM standards

Standard FFT size I/0 Throughput

(complex) data-width (bits) (MSample/s)

DVB-T/H 2048–8192 8 9

DAB 256–2048 8 8.26

VDSL 256–4096 16 1–35

802.11 a 64 8 20

802.16d/e 128–2048 10 1.25–20 (xM)

802.11 n 128 8 40 (xM)

UWB 128 4 1584 (3 channels)

BPL 512 or 1024 16 30

1UWB can reach the throughput of N × 528 MSamples/s, with N the number of sub-channels supportedin parallel and 528 MSamples/s the requirement of a single channel; the max number of channels is 14,but the typical value is 3 as to allow frequency hopping.

Page 6: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Fig. 2 FFT throughput and size for different OFDM and MIMO-OFDM transceivers

FFT length and I/O bit-width. The cascade and the in-place variable-size FFT archi-tectures are also run-time configurable: the macrocell is synthesized for the maximumsize (e.g., 8192 for DVB-T/H) and any smaller size specified by the standard (e.g.,2048 and 4096 in DVB-T/H) is supported.

For M × M MIMO-OFDM systems the throughput requirements should be in-creased M times with respect to the basic Single-Input Single-Output (SISO) scheme.The requirements on throughput and energy efficiency, particularly for mobile ter-minals, call for hardware implementation of the FFT and IFFT processors. Thewidespread diffusion of FFT/IFFT cores in different transceiver schemes suggeststhe design of parametric, reconfigurable IP hardware macrocells which are addressedin the following sections considering different architectural approaches. Particularly,a massively parallel architecture for FFT suitable to high-throughput applications (upto GSamples/s) such as UWB is proposed in Sect. 3. A configurable cascade FFT coreoptimized for large, high-throughput FFTs such as for DVB, and large data-widths isdiscussed in Sect. 4. Section 5 presents an in-place variable-length FFT core suitablefor OFDMA MIMO schemes such as WiMAX, requiring run-time reconfigurabilityand sustaining throughputs up to hundreds of MSamples/s.

3 Fully Parallel Architecture

Although intrinsically more complex than other solutions based on hardware sharing,the parallel approach allows a memory-free design and is the natural choice to tacklevery high data throughput up to GSamples/s [40]. This is particularly true for im-plementations on FPGA, whose system clock frequency is far lower than in customASIC designs.

The key facet of the parallel architecture, not allowed by other solutions, is thepossibility of an ad hoc customization of the data flow, whose width can be controlledstage by stage.

Page 7: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Furthermore, all multiplications in the algorithm turn into the multiplication bya constant factor (so-called twiddles are constant roots of the unity), which can begreatly optimized by the logic synthesis tool. The fixed-point representation of sev-eral multiplicands (real or imaginary part or both) reduces to trivial values (zero or±powers-of-two) that are costless in terms of implementation, and their number be-comes higher when reducing the required precision, which is the typical use for par-allel FFT.

3.1 VLSI Architecture

The parallel architecture of a generic N -point (I)FFT, directly derived from theCooley–Tukey algorithm [8], is shown in Fig. 3. The architecture is arranged inρ = ⌈

log4 N⌉

radix-4 stages where the first one is a radix-2 stage if N is not a powerof 4. As widely shown in literature, radix-4 (used also for the cascade approach) andradix-2 (used for the in-place variable FFT core) are the most suitable factorizationsfor cost-effective implementations, since high-radix factorizations (i.e. 8, 16 and 32)require a basic computational unit (butterfly) with non-trivial multiplications, thusbearing an unacceptable increase of hardware complexity (see [13, 26, 30]).

Each stage of the parallel FFT is composed of a set of butterfly blocks, followedby N complex multiplications by the twiddle factors. As opposed to the case of time-sharing architectures, twiddle factors take (different) constant values over the stagesof a parallel FFT; so, (real) multipliers do not need to resort to any particular architec-ture such as parallel Booth architecture [9] to speed up the elaboration or reduce thecomplexity. Since twiddle factors are complex roots of the unity, they can be treatedwith the lifting approach described in [28]: in this way, only three real multiplicationsand three additions are required instead of four multiplications and two additions asin a straightforward implementation. This approach helps to save complexity but as adrawback, increases the delay of the critical path in the design, so it is only suitableto applications where complexity is crucial while throughput is easily met.

The generic architecture of a lifting element is reported in Fig. 4, where all theparts that are customized at the time of synthesis are enclosed in dashed lines. Actu-

Fig. 3 Mixed-radix FFT processor architecture

Page 8: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Fig. 4 Architecture of thelifting element

ally, the real factors L1, L2 are optimized to the value of the particular twiddle factor,input/output sign inversions are implemented only when required, and the cross-barturns into an hard-wired connection.

3.2 Case study: Ultra-Wide Band (UWB)

As a case study, the parallel architecture described in Sect. 3.1 has been tailored fora very high speed application such as the UWB standard [18]. The 128-point parallelFFT is designed to compute 128 complex samples per clock cycle, so that it canmeet the throughput requirement of 1584 MSamples/s at the clock frequency of only12.375 MHz.

As a distinguishing feature, the UWB signal is quantized on a limited number ofbits, typically in the range from 4 to 6, and this is exploited to contain the complexityof the VLSI design.

As a case study, the input signal was then quantized on 5 bits, corresponding toabout 28 dB of signal-to-quantization-noise ratio (SQNR). Then, the data flow wasoptimized as to guarantee the same SQNR on the output of the transform, by meansof truncation (rounding) and saturation after every step (butterfly and lifting multipli-cation).

Twiddle factors were quantized on Btwd = 6 bits, which resulted in a maximumoutput SQNR of 27.2 dB, attained with no truncation nor saturation (free-growingdata flow). Actually, since the parallel FFT adopts the lifting approach, the liftingcoefficients L1 and L2 were indeed quantized on Blift = 6 bits, and Blift − 1 bits werediscarded after each real multiplication in Fig. 4.

Following a cascade approach, truncation (rounding) and saturation of the butter-fly outputs first, and saturation of the output of the lifting multiplier were applied toachieve the same output SQNR of 27 dB. The results of this procedure are graphicallydepicted in Fig. 5, where data-widths are represented with squares and for every stagethe width on its input is shown, after butterfly combining and after twiddle multipli-cation (three columns for each stage of the FFT, overall). Bits saturated (rounded)are also shown as cross-hatched squares with top-left/bottom-right (bottom-left/top-right) diagonals. As shown in Fig. 5, 1 bit is saturated after twiddle multiplication atevery stage and 1 more bit is saturated at the output of the butterfly of stages 2 and 3.

Page 9: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Fig. 5 Growth of thefixed-point data flow in a128-point parallel FFT withinput signal on 5 bits andtwiddle quantized on 6 bits

4 Cascade Architecture

4.1 Radix Cascade Architecture

A cascade approach, alternative to the high-throughput parallel FFT IP core designdescribed in Sect. 3, can offer a good trade-off between complexity and speed, withremarkable length flexibility for a variety of communication and multimedia applica-tions. The architecture presented in this section adopts a cascade of radix-4 butterflystages (the last stage of the cascade is mixed radix-4/radix-2 to support also FFTtransform lengths which are power-of-two); such an approach is suitable for stream-oriented data processing systems found in communication and multimedia applica-tions. In fact, owing to the inherent pipeline of the cascade architecture, input bufferscan be removed since buffering capability is spread across the whole data-path.

Figure 6 illustrates the top-level data-path of the cascade architecture which isfully parametric in terms of maximum FFT size (Nmax), number of radix stages (S ),word length of the I/O (IOWL), of the twiddle coefficients (TWL) and of the in-ternal system data-path (SWL). Moreover, each radix stage supports different typesof machine arithmetic: fixed-point, block floating point (BFP) and convergent blockfloating point (CBFP). For the applications considered in this paper the configurationparameters are in the ranges: N ∈ [64,8192], k ∈ [3,6] and S ∈ [3,7].

In Fig. 6, the multiplexers at the input of each stage enable the FFT/IFFT size tobe configured at run time by selecting the number of stages in the cascade throughthe length input signal. The flush and the freeze signals control the internal pipeline,while the FFT/IFFT signal is used to switch between FFT and IFFT computation.The ability to compute both the direct and inverse transforms with the same coreis useful in transceivers based on TDD techniques, as discussed in Sect. 2, wherethe core works alternatively as IFFT processor in the transmitter chain and as FFTprocessor in the receiver chain. The global control unit in Fig. 6 can manage also theinsertion/removal of cyclic prefix and suffix to the sequence as foreseen in the OFDMscheme.

As shown in Fig. 7, each stage in the cascade architecture includes the butter-fly/multiplier unit, a module for sequencing data and a ROM containing the twiddle

Page 10: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Fig. 6 Programmable cascade of radix-4/2 stages in the FFT/IFFT IP core

Fig. 7 Generic stage architecture

Fig. 8 Radix-4 butterfly with complex multiplier

factors. The data-path of the radix-4 butterfly is sketched in Fig. 8 where thick ar-rows are used for complex values and thin arrows for real ones. The complex mul-tiplier is implemented with three Booth multipliers and six adders. Therefore, thebasic butterfly in the cascade approach is more complex than the same unit in theparallel architecture in Sect. 3, which avoids complex multipliers. However, the par-allel architecture includes multiple butterfly units at every radix stage, while a singlebutterfly per stage is used in the cascade architecture; so, for FFTs of considerablelength, the number of butterflies and the relevant hardware complexity in the cascadeis minimized vs. the parallel approach.

Page 11: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

The data-sequencing module uses memory banks for reordering data as dictatedby the algorithm [4]. These memory banks can be exploited to remove input bufferssince buffering capabilities are embedded in the data-path. Equations (1) and (2) de-tail the memory requirements, RAM for data and ROM for twiddle coefficients, forthe j th stage in the cascade architecture in terms of number of banks × data-width× number of locations. IOWL is the data-width of the real/imaginary part of theprocessor input/output. SWL is the data-width of the internal data-path. TWL is thedata-width for the real and imaginary parts of the twiddle factors.

RAM(j) =

⎧⎪⎨

⎪⎩

7 · 2 IOWL Nmax4·2m j = 1

7 · 2 SWL Nmax4j j ∈ [2, S − 1]

2 SWL j = S(1)

ROM(j) = 2 TWL · Nmax

8 · 4j−1j ∈ [1, S − 1] (2)

By exploiting the symmetry of the unit circle on the complex plane, the size of thetwiddle ROM has been reduced by a factor 8 vs. a conventional implementation. Thisis achieved through a very simple circuit that exchanges the real and imaginary partsand/or complements the sign of the coefficients stored in ROM. The same circuitconjugates the twiddle factors when the core is configured for IFFT.

The data-sequencing stage is based on small memory banks and is designed sothat the whole processor can sustain a throughput of one complex sample per clockcycle. Therefore, for standards such as UWB or MIMO systems with a large numberof channels, a cascade architecture would require a clock frequency above 1 GHz,unfeasible on FPGA devices or with low-power ICs. The cascade approach is moresuitable to mid-range throughput applications within one hundred of MSamples/ssuch as DVB, DAB, WLAN, WMAN, DSL and BPL. In the proposed architecture,the internal pipeline can be frozen or flushed by an external control unit to match theinput traffic rate or in case of run-time reconfiguration (FFT/IFFT mode, transformlength).

The latency of the whole architecture varies with the actual transform length and ismainly determined by the sum of the latencies of the cascaded radix-4 stages (hencew.r.t. the parallel approach, a higher latency is paid in the cascade architecture forFFTs of considerable length). Figure 9 details the pipeline of a radix-4 stage where

Ldr(N, j) = N

4j−1(3)

4.2 FFT/IFFT IP Core Configuration and DVB Case Study

By using a custom software tool, based on Monte Carlo simulations, we are able toprofile the three arithmetic types supported by the FFT cascade macrocell in termsof SQNR and to provide the designer with information for selecting the arithmeticthat suits the target application best. System-level requirements specify the FFT/IFFTlength (N ) and a SQNR budget, i.e. the desired bit true IOWL, and the tool derives the

Page 12: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Fig. 9 Internal pipeline of aradix-4 stage

optimal values of internal width (SWL and TWL) and the most suitable arithmetic toachieve the desired output precision minimizing data-path and memory sizes. A 64-bit floating point FFT/IFFT processor is considered as a golden reference model andthe tool has been applied to the cascade architecture to cover the OFDM standardspresented in Sect. 2. The results of the parametric configuration are summarized inTable 2.

Besides the configuration parameters, also the FFT macrocell data-base is gener-ated including the RTL VHDL code, test-benches and test vectors for verification andperformance estimation (e.g. dynamic power consumption).

As a rule of thumb, CBFP arithmetic is a good solution for FFTs of consider-able length (N ≥ 1024) requiring very high accuracy (IOWL ≥ 14), as in the case ofVDSL or BPL. For example, the use in BFP arithmetic of the same set of parameters(SWL, TWL) of CBFP, would result in a performance loss of about 4 dB for VDSL.In other words, BFP arithmetic would require SWL = 23 to achieve the same SQNRthat CBFP achieves with SWL = 18.

For the other standards (DVB-T/H, DAB, WLAN, WMAN, UWB) the SQNRbudget is such that BFP and CBFP approaches exhibit the same performance withsimilar SWL and TWL values, thus BFP arithmetic is preferred to save the extracircuit complexity of the magnitude estimation unit.

Figure 10 shows the SQNR for different values of SWL and TWL for the 8K-mode of DVB-T/H. Selecting BFP instead of CBFP, the same configuration of SWLand TWL (11 and 3, respectively) results in a loss of 0.7 dB in SQNR, 43.7 dB forBFP instead of 44.4 dB for CBFP. Since the standard requirement is around 43 dB,the BFP is preferred; indeed, its complexity is lower than CBFP for the same SWLand TWL width.

Adopting an M × M MIMO scheme, such as for WLAN 802.11n or WMAN802.16, requires the integration of P processors running at a clock frequency M/P

times faster than the basic SISO scheme, independently from the SQNR budget. Asa result, the arithmetic type and the word lengths are the same for a given standardboth in MIMO (M = 2, 4) or SISO (M = 1) configurations.

5 In-place Variable Length FFT with Parallel Butterfly Processors

This section presents a reconfigurable FFT architecture suitable for 4 × 4 MIMOOFDMA wireless systems that processes up to 4 streams with variable symbollengths, ranging from 128 to 2048 complex points.

Page 13: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Fig. 10 SQNR vs. SWL andTWL, 8K mode DVB-T/H,43 dB SQNR target

Fig. 11 Overview of the in-place, variable length, FFT architecture

Table 2 Cascade IP configuration in different OFDM standards

Standard Arithmetic SQNR Stages

type IOWL SWL TWL (dB) S

DVB-T/H BFP 8 11 3 43.6 7

DAB BFP 8 11 5 43 6

VDSL CBFP 16 18 12 94.2 6

802.11a/n BFP 8 11 6 44.7 4

802.16d/e BFP 10 13 7 54.9 6

UWB BFP 5 7 4 25.4 4

BPL CBFP 16 18 10 93.7 5

5.1 Overview of the FFT Organization

The engine computes decimation in time (DIT) FFT algorithms of variable length byusing in-place technique with radix-2 factorization. It consists of 16 butterfly proces-sors, 32 banks and an interconnection network used to group the processors in setsof 2, 4, 8 or 16 (Fig. 11). Each processor can be reconfigured at run time to computeFFTs of 32, 64 or 128 complex points. The execution of the radix-2 FFT algorithm

Page 14: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

for these input sizes requires 5, 6 or 7 stages, respectively. At these stages, the pro-cessor Pi updates two points per cycle and uses only two memory banks for storingall the intermediate results: its private bank Bi,0 and its auxiliary bank Bi,1 (64 ad-dresses each). FFTs of length 256, 512, 1024 or 2048 are computed by allowing theprocessors to cooperate in groups and execute the remaining stages of the algorithm.During these higher stages of the computation, the processor Pi uses the interconnec-tion network to access the auxiliary bank Bk,1 of another processor Pk . The indicesi and k are defined by the computation flow of the FFT algorithm (the required but-terfly calculation) and by a specific conflict-free in-place technique described in thefollowing subsection. The application of this novel technique leads to the reduction ofthe interconnection network and to the minimization of the total computation cycles:only half of the 32 banks need to be shared among the processors (i.e. the auxiliarybanks) and, moreover, no conflicts stall the execution of the algorithm.

The reconfiguration and the grouping of the processors are determined by an in-ternal scheduler depending on the length of the 4 FFTs (input symbols) at each time.The scheduler design has focused on sustaining the required throughput rate of eachstream. To simplify the process, we have included an input buffer and two distinctoperation modes for the scheduler.

We use the first operation mode when there is at least one stream with symbollength greater than or equal to 1024. For such cases the scheduler will use the inputbuffer to collect symbols of the same length, whose sum is 2048, i.e. 16 symbolsof length 128, 8 symbols of length 256, etc. The collection of 2048 data belongs tosame stream and constitutes the input to the 16-processor engine. The collections areprocessed sequentially in a round-robin fashion, sustaining an average throughput foreach stream equal to its input rate. Each processor is configured to perform 128-pointFFT.

The second mode handles the remaining cases (where all symbols have length≤ 512). The scheduler assigns the symbol of each stream to a dedicated group offour processors. The input streams are processed in parallel, as the processor groupsoperate independently from each another. Each processor is configured to performFFT with length equal to 1/4 of the symbol length assigned to its group. Note that,in order to sustain the required throughput even in the worst case (4 FFTs of 2048points each), the operating clock frequency of the design is set to fop = 1.375 · fin,where fin denotes the input data rate.

5.2 In-place Technique

The proposed organization uses the in-place technique of [27], modified accordingly,to produce a sorted FFT output by using as a key the indices of the elements (theinitial address of the elements). The input elements are stored in the banks Bi,0, Bi,1such that the LSB of each element’s index specifies its storing bank. Specifically,at the output of each butterfly computation, the following permutation is performed.Consider the elements xs, xr forming a transformation couple at stage (pass) j . Theindices xs, xr differ only at the j th bit and the results will be exchanging memorylocations if the bit (j + 1) of the indices xs, xr is 1. The output elements are sortedwith indices 0, . . . ,N/2 − 1 stored in banks Bi,0 in increasing order and indices

Page 15: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

N/2, . . . ,N −1 stored in Bi,1 in decreasing order. Besides the output permutations ateach butterfly, we perform an input permutation: we exchange xs, xr at the butterflyinput if xs is stored in Bi,1. Note that the aforementioned scheme allows an efficientinterconnection of the processors because each processor needs to access only fourauxiliary banks (besides its own).

5.3 Butterfly Processor Architecture

Each radix-2 butterfly processor Pi has two inputs (IR , IS ) and two outputs (OR ,OS ), the two dual-port memory banks Bi,0, Bi,1, the FFT control, the interconnec-tion between the processor and the banks, the data address generation circuit and thetwiddle address generation (Fig. 12).

Focusing on a 128-point processor as a reference case (the cases of 32- and 64-point are similar), the FFT control includes a 10 bits up counter to handle 1024 pairsof data (worst case scenario) and a 4 bits down counter to handle the FFT stages(passes). During an FFT, the first 64 pairs are in Bi,0, Bi,1. With more than 64 pairs(2j pairs, 7 ≤ j ≤ 10), the processor performs the same operations on data stored inBi,0 and Bx,1, with x defined by the interconnection. The address generation circuit(Fig. 13) uses the two control counters to generate the data addresses at each stageand to control the I/O multiplexers (Fig. 12). During the j th stage the circuit willaddress N/2 pairs belonging to N/2j+1 FFT sub-blocks. The circuit generates theaddresses of the pairs by forming a word, which consists of the 6 least significant bitsof the up counter. The addresses for Bi,0 are generated by resetting the bit j − 1 ofthis word. The addresses for Bi,1 are generated by inverting the j −1 least significantbits of the Bi,0 address.

The multiplexers at the butterfly outputs (OR , OS ) realize the in-place techniquepermutations and are controlled by the j th bit of the up counter (at pass j ): if the j thbit is 1, then the multiplexers exchange the outputs of the butterfly (swapout signalof Fig. 13). The multiplexers at the IR , IS inputs of the butterfly are controlled by

Fig. 12 Architecture of the radix-2 processor

Page 16: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Fig. 13 The Address Generator of the radix-2 processor

the (j − 1)th bit of the up counter (swapin signal of Fig. 13). The multiplexers atthe output of the address generator (read addresses) are also controlled by the swapinsignal.

The radix-2 processor is organized to compute an FFT by performing DIT andproducing 32, 64 or 128 sorted outputs: after the FFT completion, the elements with(max) indices d0, . . . , d63 will be in the addresses 0, . . . ,63 of Bi,0 (increasing order)and the elements d64, . . . , d127 will be in the addresses 63, . . . ,0 of Bi,1, respectively(in decreasing order).

A twiddle generator circuit is used with a ROM of Nmax/8 coefficients, i.e. 256,controlled by the vInv unit (Fig. 13). Specifically, assume that we execute the laststage of a sub-FFT of length 2j+1 on the two sub-FFTs of length 2j (the two sub-FFTs have their results sorted as above). We read the twiddles of the first 2j−1 pairsby increasing a counter and the twiddles of the remaining 2j−1 pairs by decreasinga counter as follows. At pass j we use the j + 1 least significant bits to create a 10bit word: these j + 1 bits are used as MSBs followed by 0’s (input V of the vInv ofFig. 13). If the MSB of the above word is equal to 0, then this word will be used asa twiddle address; else, the remaining j MSBs (apart the MSB itself) are inverted tocreate the twiddle address.

5.4 Interconnection Scheme

In the proposed architecture, certain stages of the FFT require that each processor Pi

accesses data from the auxiliary bank of a remote processor. To optimize the intercon-nection network, we have designed the data flow such that only the lower input/outputof Pi is connected to a remote bank (the upper is always connected to Bi,0). More-over, each Pi connects only to 4 auxiliary remote banks (besides Bi,1).

To determine the set of banks connected to each Pi , we must take into account theflow-graph of the FFT and the aforementioned in-place technique. Assume that j is

Page 17: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

a FFT stage where each Pi will access data belonging to processor Pk (recall that notall stages require remote accesses).

Let i = [i3i2i1i0]; the index k = [k3k2k1k0] is obtained in the following two steps.First, we consider the effect of the output permutation, which is a bitwise exclusive-or (XOR) operation on the index i with a 4-bit number containing j ′ ones in the j ′LSBs and zeros otherwise. Second, data exchanges at the input occur at processorswhose index has the (j ′ − 1)th bit set. For these processors, the index k is computedby the first step calculation and corrected by performing another bitwise exclusive-or operation with a number containing j ′ − 1 ones in the j ′ − 1 LSBs (and zerosotherwise). Therefore, k is produced by superimposing the two permutations (inputand output) during stages j ≥ 8. More specifically,

k = [k3k2k1k0]= [i3i2i1i0] ⊕ [0 . . .1 . . .1

︸ ︷︷ ︸j ′

] ⊕ [0 . . . ij ′−1 . . . ij ′−1︸ ︷︷ ︸j ′−1

]

Therefore, the interconnection network for each processor consists of a 5-to-1multiplexer at the processor’s lower input Is and a 1-to-5 demultiplexer at its loweroutput Os . The connections to each (de)multiplexer can be computed from the equa-tion above. Note that the depth of the address calculation circuit by using the proposedtechnique is constant, irrespective of the size N of the FFT transform.

6 Implementation Results and Comparison with the State-of-the-Art

This Section compares the three architectures described in this paper with state-of-the-art FFT VLSI designs, considering the most suitable target application for eachsolution. Comparing IP cores of different architectures and implementation technolo-gies becomes ambiguous; so, for the sake of fairness, we selected state-of-the-artdesigns with system-level requirements—expressed in terms of throughput and nu-merical accuracy—similarly to each of the three proposed architectures for any targetapplication. Section 6.1 reports the results in terms of complexity and throughput forthe three architectures. Then, Sect. 6.2 compares the three proposed implementa-tions to each other, assuming common applications and FPGA technology. Finally,the most suited architecture solution for each of the telecommunication standards ofSect. 2 is presented, along with the relevant implementation complexity results.

6.1 Comparison vs. State-of-the-Art VLSI FFT Designs

Table 3 compares the cascade architecture presented in Sect. 4 with other state-of-the-art FFT cores targeting DVB-T/H applications. Complexity results refer to gateand memory complexity in silicon for a 90 nm CMOS technology and standard-cells library. The comparison includes an application-specific instruction set proces-sor (ASIP) [23], two macrocells specifically designed for DVB-T (see [24, 42]) anda macrocell obtained by an automatic IP generator [10]. Note that the proposed cas-cade macrocell stands for its low complexity while maintaining similar application

Page 18: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Table 3 Comparison with state-of-the-art DVB-T FFT cores (SQNR budget ≥40 dB)

Implementation IP type Arithmetic fclk Complexity

type (MHz) Kgates RAM bits ROM bits

Cascade [this paper] generator BFP 9 37 219738 8190

Lee et al. [23] ASIP fixed-point 280 80 1572864

Wang et al. [42] custom fixed-point 16 139 211008 165120

Cortés et al. [10] generator fixed-point 9 48.7 262112 305760

Li et al. [24] custom BFP 8 91 n.a. n.a.

performance: throughput of 9 MSamples/s, SQNR greater than 40 dB, variable trans-form length between 2048 and 8192. It is worth noting that when implemented inFPGA technology, the cascade FFT core, configured for DVB-T/H, can be fitted ina low-cost device family such as the Spartan3 from Xilinx (a XC3S200 device issufficient); if the more powerful Virtex4 family FPGA is adopted, then the cascadeFFT core needs less than 10000 slices. Both Spartan3 and Virtex4 device families areSRAM-based FPGA realized in 90 nm silicon CMOS technology.

As far as the parallel architecture is concerned, it can be fairly compared with theUWB FFT core proposed by Sherrat et al. in [37], whose target is the real-time imple-mentation of a 128-point 528-MSamples/s FFT. To achieve the above performance,four 128-point radix-2 pipelined processors are exploited in [37], working concur-rently and each clocked at 132 MHz with a clock phase delay of 0, 90, 180, and 270degrees generated by a digital clock management (DCM) unit. Fitted on FPGA tech-nology, this state-of-the-art UWB FFT core requires roughly 5000 slices of a Virtex4 device. Instead, the full parallel architecture proposed in Sect. 3, roughly requires20,000 slices on a similar Virtex4 FPGA technology but, owing to its high paral-lelism, it can reach a throughput higher than 7 GSamples/s with a clock frequencyof only 55 MHz. Therefore, our architecture occupies more slices (4 times higher)than [37], but it is also 14 times faster; as a result, while the macrocell in [37] can sus-tain only 1 UWB channel out of the 14 available channels in the ECMA standard [18],the proposed architecture allows the real-time realization of an UWB communicationwith full capabilities (14 channels).

Finally, we examine the in-place, reconfigurable architecture presented in Sect. 5.To estimate its cost, we implemented a 10-bit I/O FFT on a Xilinx Virtex 4 FPGA(XC4VLX200). The FFT core occupies 8614 slices, 64 DSP blocks, and 80 RAMblocks. The input buffer occupies 3000 slices and 256 RAM blocks. Overall, the de-sign operates at 34.375 MHz and achieves 56.8 dB SQNR by using 13-bit data-paths.Recalling that the implemented design computes 128–2048 complex points FFT on 4independent streams (targeting 4 × 4 MIMO applications), it can be fairly comparedto a straightforward solution made of 4 independent FFT modules. For this reason, wealso implemented the well-known SDF (Single Delay Feedback) architecture of [39](radix-2, 10-bit I/O, variable length). Overall, four instances of the SDF architec-ture occupy 16,124 slices, 152 DSP blocks, and 152 RAM blocks, and operate at25 MHz. Clearly, the solution presented in Sect. 5 requires less (almost half) process-ing resources when compared to the straightforward solution. As a drawback of the

Page 19: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Table 4 FPGA implementationcomparison: in-placevariable-length against SDFarchitecture

Archite- Xilinx Operating DSP RAM

cture slices frequency blocks blocks

In-place 8614 34 MHz 64 80

R-2 16124 25 MHz 152 152

R-22 13745 25 MHz 96 174

R-23 13467 25 MHz 88 174

proposed solution, we point out the extra memory resources required to buffer thedata, as well as the increased operating frequency. We have also compared the pro-posed architecture to SDF bearing higher radices, radix-22 and the radix-23. Note thatin these cases reconfiguration becomes quite involved since it requires extra paths tobypass the stage processors, which remain idle in the various configurations for dif-ferent FFT lengths. Table 4 compares the complexity of the proposed architectureagainst three variations of the straightforward solution, which is based on four par-allel SDF architectures: the first includes only radix-2 butterflies, the second realizesthe radix-22 algorithm and the third the radix-23.

6.2 Comparisons Between the Proposed FFT Architectures

To compare the three architectures with each other, we implemented distinct FFTmodules on the same Xilinx Virtex 4 FPGA (XC4VLX200) technology. Table 5 re-ports the implementation results assuming three distinct applications. The first twocases assume a fixed length for the FFT input (128 and 1024 points, respectively),while the third case is for a variable length FFT (128–2048 points). Note that, for thesake of a fair comparison, each implementation only resorted to FPGA slices, withno DSP block. As expected, the fully parallel FFT can sustain the highest through-put rate at the expense of an increased hardware cost. The cost difference betweenthe parallel and the cascaded FFT cores increases significantly with the length ofthe FFT input. However, as Table 5 shows, the fully parallel architecture is the mostprominent solution to tackle throughput rates as high as 50 GSamples/s. In another di-rection, the cascade FFT offers higher throughput rates with less hardware resourcesthan the in-place reconfigurable FFT. Nonetheless, we must consider the fact that thereconfigurable architecture supports variable length FFT of up to 2048 points. Simi-lar modifications in the cascaded architecture would increase its hardware cost above12,400 slices (due to the extra FFT stages and the bypass circuits). Moreover, con-sider that the reconfigurable architecture is tailored for MIMO applications. A directuse of the cascaded approach in such applications requires either the use of multipleFFT modules (with a significant cost increase) or the use of one module operating ata multiple of the input rate (with higher power consumption).

The results of Table 5 show the advantages of each architecture. On the one hand,the in-place variable FFT is suitable for WiMAX mobile and 3GPP LTE applications(MIMO-OFDMA systems) with up to 4 streams at 100 MSamples/s. On the otherhand, the cascaded architecture is the most economic solution in SISO systems withFFT of fixed size, e.g., for fixed WiMAX, DVB, DAB, VDSL, BPL, and WLAN.

Page 20: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Tabl

e5

Impl

emen

tatio

nof

the

prop

osed

FFT

arch

itect

ures

onV

irte

x4

FPG

A(X

C4V

LX

200)

Arc

hite

ctur

eca

se1:

128-

poin

ts,5

-bit

I/O

case

2:10

24-p

oint

s,8-

bitI

/Oca

se3:

vari

able

leng

th,1

0-bi

tI/O

XC

4Vsl

ices

thro

ughp

utX

C4V

slic

esth

roug

hput

XC

4Vsl

ices

thro

ughp

ut

Fully

para

llel

2227

07

GSa

mpl

es/s

1345

2350

GSa

mpl

es/s

––

Cas

cade

6236

132

MSa

mpl

es/s

1240

013

2M

Sam

ples

/s–

In-p

lace

,rec

onfig

urab

le(4

stre

ams)

––

––

1898

110

0M

Sam

ples

/s

Page 21: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

Table 6 Most suited FFTarchitecture for differenttelecommunication standards inVirtex4 FPGA technology

Standard FFT arch. XC4V slices MSamples/s

DAB and DVB-T/H cascade 15195 10

xDSL and BPL cascade 25507 35

802.16d/e in-place var. 18981 100

802.11n (MIMO) in-place var. 15790 160

UWB (14 chan.) parallel 22270 7000

Clearly, the in-place variable FFT implements very effective techniques to supportmultiple streams and/or different FFT lengths with small complexity overhead, whilethe cascaded FFT avoids any unnecessary overhead to efficiently support a singlestream. Finally, for the case of UWB, the two above solutions fall short in terms ofthroughput rate, making the fully parallel architecture the most suitable approach.

Table 6 summarizes the most suitable solution for each of the aforementionedstandards (fully parallel, cascade, or in-place variable) and the relevant complexityon Virtex 4 FPGA technology (XCVLX200). Again, each implementation is onlybased here on FPGA slices (no DSP blocks) for the sake of comparison. Note thatin Table 6 the same FFT engine can support different standards with similar require-ments: for instance, from Tables 1 and 2, DAB and DVB-T/H have similar throughput(around 10 MSamples/s) and arithmetic accuracy requirements (8-bit I/O), while themaximum FFT length is set by DVB; BPL and xDSL have similar throughput (around30 MSamples/s) and arithmetic accuracy requirements (16-bit I/O), while the maxi-mum FFT length is set by VDSL.

7 Conclusion

This paper has presented three distinct FFT/IFFT architectures aiming to supportmulti-carrier OFDM-based telecommunication systems. Introducing design featuresto enhance a fully parallel, a cascade and a reconfigurable architectural approachesled to the design of FFT/IFFT modules suitable for the most widely used protocolsand improving the performance to cost ratio. The fully parallel architecture employsfine grained techniques to achieve high throughput rates, tens of GSamples/s, suchthose required in UWB when all channels are used. The cascade architecture leads toan efficient pipeline for SISO systems with large FFT length but moderate throughput(e.g. DSL, BPL, DVB-T/H, DAB). Finally, combining reconfiguration and in-placetechniques results in a low-cost architecture fulfilling the requirements of MIMOsystems with FFTs of variable size (WiMAX, MIMO WLAN). The design trade-offsare shown through the implementation results of each architecture. The comparisonto the corresponding literature solutions favors the three architectures proposed inthis paper.

Acknowledgement This work was supported by the European Commission in the framework of theFP7 Network of Excellence in Wireless COMmunications NEWCOM++ (contract n. 216715).

Page 22: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

References

1. P. Amirshahi, M. Navidpour, M. Kavehrad, Performance analysis of uncoded and coded OFDM broad-band transmission over low voltage power-line channels with impulsive noise. IEEE Trans. PowerDeliv. 21(4), 1927–1934 (2006)

2. J.G. Andrews, A. Ghosh, R. Muhamed, Fundamentals of WiMAX, Understanding Broadband WirelessNetworking. Prentice Hall Communications Engineering and Emerging Technologies Series (PrenticeHall, New York, 2007)

3. F. Baronti et al., Design and verification of hardware building blocks for high-speed and fault-tolerantin-vehicle networks. IEEE Trans. Ind. Electron. 58(3), 792–801 (2011)

4. G. Bi, E. Jones, A pipelined FFT processor for word-sequential data. IEEE Trans. Acoust. SpeechSignal Process. 37(12), 1982–1985 (1988)

5. J. Bingham, Multicarrier modulation for data transmission: an idea whose time has come. IEEE Com-mun. Mag. 28(5) (1990)

6. R. Cabral, S. Escarigo, H. Neto, H. Sarmento, Implementation of a DAB receiver with FPGA tech-nology, in Proc. IEEE ICCE, Jan 2006, pp. 397–398

7. A. Chimenti et al., VLSI architecture for a low-power video codec system. Microelectron. J., 33(5–6),417–427 (2002)

8. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. IEEETrans. Electron. Comput. EC-15(4), 680–681 (1966)

9. A.R. Cooper, Parallel architecture modified booth multiplier. IEE Proc. G, Electron. Circuits Syst.135, 125–128 (1988)

10. A. Cortes, I. Velez, J. Sevillano, A. Irizar, An FFT core for DVB-T/DVB-H receivers, in Proc. ThirdIEEE International Conference on Electronics, Circuits, and Systems (ICECS), Dec 2006, pp. 102–105

11. M. Deinzer, M. Stoger, Integrated PLC-modem based on OFDM, in Int. Sym. On Power-line Com-munications and its Applications (ISPLC’99) (1999)

12. C. Del-Toso, M. Nava, A short overview of the VDSL system requirements. IEEE Commun. Mag.,40(12), 82–90 (2002)

13. P. Duhamel, M. Vetterli, Fast Fourier transforms: a tutorial review and a state of the art. Signal Process.19(4), 259–299 (1990)

14. L. Fanucci et al., A parametric VLSI architecture for video motion estimation. Integration 31(1), 79–100 (2001)

15. L. Fanucci et al., Parametrized and reusable VLSI macrocells for the low-power realization of 2-Ddiscrete-cosine-transform. Microelectron. J. 32(12), 1035–1045 (2001)

16. L. Fanucci et al., Power optimization of an 8051-compliant microcontroller. IEICE Trans. Electron.88(4), 597–600 (2005)

17. B. Farahani, M. Ismail, WiMAX/WLAN radio receiver architecture for convergence in WMANS, inIEEE 48th Midwest Symposium on Circuits and Systems, Aug 2005, pp. 1621–1624

18. High rate ultra wideband PHY and MAC standard, Dec 2005, standard ECMA-36819. H. Holma, A. Toskala, LTE for UMTS, OFDMA and SC-FDMA Based Radio Access (Wiley, New

York, 2009)20. IEEE 802.11-05/1102r4, IEEE P802.11 Wireless LANs Joint Proposal: High throughput extension to

the 802.11 Standard: PHY, Jan 200621. Y. Jung, J. Kim, S. Lee, H. Yoon, J. Kim, Design and implementation of MIMO-OFDM baseband

processor for high-speed wireless LANs. IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,54(7), 631–635 (2007)

22. M. Kornfeld, DVB-H—the emerging standard for mobile data communication, in IEEE InternationalSymposium on Consumer Electronics, Sept. 2004, pp. 193–198

23. J. Lee, J. Moon, K. Heo, M. Sunwoo, S. Oh, I. Kim, Implementation of application-specific DSP forOFDM systems, in Proc. IEEE International Conference on Circuits and Systems (ISCAS), May 2004,vol. 3, pp. 665–668

24. X. Li, Z. Lai, J. Cui, A low power and small area FFT processor for OFDM. IEEE Trans. Consum.Electron., 53(2), 274–277 (2007)

25. Y.-W. Lin, C.-Y. Lee, Design of an FFT/IFFT processor for MIMO OFDM systems. IEEE Trans.Circuits Syst. I, 54(4), 807–815 (2007)

26. N. L’insalata et al., Automatic synthesis of cost effective FFT/IFFT cores for VLSI OFDM systems.IEICE Trans. Electron., E91-C(4), 487–496 (2008)

Page 23: Design and Comparison of FFT VLSI Architectures for SoC Telecom

Circuits Syst Signal Process

27. K. Nakos, D. Reisis, N. Vlassopoulos, Addressing technique for parallel memory accessing in Radix-2 FFT Processors, in IEEE Int. Conference on Electronics, Circuits and Systems (ICECS), Sep 2008,pp. 52–56

28. S. Oraintara, Y.J. Chen, T.Q. Nguyen, Integer Fast Fourier Transform. IEEE Trans. Signal Process.50(3), 607–618 (2002)

29. S. Perels, D. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, H. Bolcskei, ASIC implementation ofa MIMO OFDM transceiver for 192 Mbps WLAN, in Proc. IEEE ESSCIRC2005 (2005)

30. K. Prakash, M.M. Rao, Fixed-point error analysis of radix-4 fht algorithm with optimised scalingschemes. IEE Proc., Vis. Image Signal Process. 142, 65–70 (1995)

31. S. Saponara, L. Fanucci, VLSI design investigation for low-cost, low-power FFT/IFFT processing inadvanced VDSL transceivers. Microelectron. J. 34(2), 133–148 (2003)

32. S. Saponara, K. Denolf, G. Lafruit, C. Blanch, J. Bormans, Performance and complexity co-evaluationof the advanced video coding standard for cost-effective multimedia communications. EURASIP J.Appl. Signal Process. 2004(2), 220–235 (2004)

33. S. Saponara, L. Fanucci, S. Marsi, G. Ramponi, Algorithmic and architectural design for real-time andpower-efficient Retinex image/video processing. J. Real-Time Image Process. 1(4), 267–283 (2007)

34. S. Saponara, L. Fanucci, S. Marsi, G. Ramponi, D. Kammler, E. Witte, Application-specificinstruction-set processor for retinex-like image and video processing. IEEE Trans. Circuits Syst. II,Analog Digit. Signal Process. 54(7), 596–600 (2007)

35. S. Saponara, L. Fanucci, P. Terreni, Architectural-level power optimization of microcontroller coresin embedded systems. IEEE Trans. Ind. Electron. 54(1), 680–683 (2007)

36. S. Saponara, P. Nuzzo, C. Nani, G. Van der Plas, L. Fanucci, Architectural exploration and design oftime-interleaved SAR arrays for low-power and high speed A/D converters. IEICE Trans. Electron.92-C(6), 843–851 (2009)

37. R.S. Sherrat, O. Cadenas, N. Goswami, A low clock frequency FFT core implementation for multi-band full-rate ultra-wideband (UWB) receivers. IEEE Trans. Consum. Electron. 51(3), 798–802(2005)

38. D. Skellern, A high-speed wireless LAN. IEEE MICRO 17(1), 40–47 (1997)39. C.D. Thompson, Fourier transform in VLSI. IEEE Trans. Comput. C-32(11), 1047–1057 (1983)40. F. Vitullo et al., Low-complexity link microarchitecture for mesochronous communication in

Networks-on-Chip. IEEE Trans. Comput. 57(9), 1196–1201 (2008)41. J. Walko, Click here for VDSL2. Commun. Eng. 3(4), 9–12 (2005)42. C.-C. Wang, J.-M. Huang, H.-C. Cheng, A 2K/8K mode small-area FFT processor for OFDM demod-

ulation of DVB-T receivers. IEEE Trans. Consum. Electron. 51(1), 28–32 (2005)