Upload
ishak
View
213
Download
0
Embed Size (px)
Citation preview
FPGA Implementation of Low Power 64-PointRadix-4 FFT Processor for OFDM System
Ishak SuleimanTM Research & Development Sdn.Bhd.
Idea Tower, UPM-MTDC, Technology Incubation Centre OneLebuh Silikon, 43400 Serdang, Selangor, Malaysia
E-mail: [email protected]
Abstract-FFT processor is a crucial block in multi-carriersystems like OFDM (Orthogonal Frequency DivisionMultiplexing) based Wireless LAN (IEEE 802.11). The portableusage applications of these systems require for low power FFTprocessor. This paper proposes a radix-4 butterfly architectureusing recursive technique for reducing hardware complexity andpower consumption using multipliers. A full pipelinedarchitecture design is proposed for constant data throughput forevery clock cycle. The FFT processor has been implemented onXilinxs' FPGA devices (XCV1000E-8HQ240, X2V3000-6FFl152,X2V6000-6FFl152 and XC2VP30-7FFl152) with deviceutilization around 35% of the chip, running at an estimatedfrequency clock 20MHz and with estimated power of 400 mW.
based implementation.
':
Keywords - FFT; OFDM; ADSL; WLAN; 4G; radix-4; lowpower
I. INTRODUCTION
The Fast Fourier Transform (FFT) and its inverse (IFFT)are essential in the field of digital signal processing (DSP) togive parallelism of data symbol representation in time domainto frequency domain in modem design (broadband datatransmission) [1-5]. Signal transformation from time domain tofrequency domain using FFT and vice versa are shown in Fig.1. The popularity of the orthogonal frequency divisionmultiplexer (OFDM) system has increased the demand forhigh-speed and low-power FFT for various broadbandapplications such as Asymmetric Digital Subscriber Line(ADSL), Wireless Local Area Network (IEEE 802.11a/b and802.16), HIPERLAN/2 and fourth generation (4G) systems [15]. Among various FFT algorithms, the Cooley-Turkeyalgorithm [6] is the most popular because it reducescomputational complexity and regularity of the algorithm thatmakes it suitable for hardware implementation. To furtherreduce the computational complexity, radix-4 is proposed [6].
FFT enables broadband data transmission but it alsorequires higher power processing for high data rates application[1-5]. The key factor of the proposed architecture design is toenable low power implementation without losing performance.A novel architecture of the FFT processor for commutator isproposed using three stage 64 point radix-4 FFT. FIFO basedcommutator can be implemented in two ways using SR (ShiftRegister) or DM (Dual port RAM). This paper proposes an SR
This research work was supported by Telekom Malaysia Bhd known asTM. Project No. R03-0568 "OFDM Based Wireless LAN Processor".
1-4244-0011-2/05/$20.00 ©2005 IEEE.278
lilJ1l1IFFT~-[-[-l [f [ .f.: -/ - 0 - 0 • J -:
Figure I. Signal transformation using FFT/IFFT.
Figure 2. Transmitter and receiver block diagram for the OFDM PHY(the figure adopted from IEEE 802.lla Standard [5], pg. 24).
II. ALGORITHM
The 64-point radix-4 FFT of a finite duration sequence isgiven in [6] as:-
63
X(k) = Lx(n)W;:n=O
FPGA Implementation of Low Power 64-PointRadix-4 FFT Processor for OFDM System
Ishak SuleimanTM Research & Development Sdn.Bhd.
Idea Tower, UPM-MTDC, Technology Incubation Centre OneLebuh Silikon, 43400 Serdang, Selangor, Malaysia
E-mail: [email protected]
Abstract-FFT processor is a crucial block in multi-carriersystems like OFDM (Orthogonal Frequency DivisionMultiplexing) based Wireless LAN (IEEE 802.11). The portableusage applications of these systems require for low power FFTprocessor. This paper proposes a radix-4 butterfly architectureusing recursive technique for reducing hardware complexity andpower consumption using multipliers. A full pipelinedarchitecture design is proposed for constant data throughput forevery clock cycle. The FFT processor has been implemented onXilinxs' FPGA devices (XCV1000E-8HQ240, X2V3000-6FFl152,X2V6000-6FFl152 and XC2VP30-7FFl152) with deviceutilization around 35% of the chip, running at an estimatedfrequency clock 20MHz and with estimated power of 400 mW.
based implementation.
':
Keywords - FFT; OFDM; ADSL; WLAN; 4G; radix-4; lowpower
I. INTRODUCTION
The Fast Fourier Transform (FFT) and its inverse (IFFT)are essential in the field of digital signal processing (DSP) togive parallelism of data symbol representation in time domainto frequency domain in modem design (broadband datatransmission) [1-5]. Signal transformation from time domain tofrequency domain using FFT and vice versa are shown in Fig.1. The popularity of the orthogonal frequency divisionmultiplexer (OFDM) system has increased the demand forhigh-speed and low-power FFT for various broadbandapplications such as Asymmetric Digital Subscriber Line(ADSL), Wireless Local Area Network (IEEE 802.11a/b and802.16), HIPERLAN/2 and fourth generation (4G) systems [15]. Among various FFT algorithms, the Cooley-Turkeyalgorithm [6] is the most popular because it reducescomputational complexity and regularity of the algorithm thatmakes it suitable for hardware implementation. To furtherreduce the computational complexity, radix-4 is proposed [6].
FFT enables broadband data transmission but it alsorequires higher power processing for high data rates application[1-5]. The key factor of the proposed architecture design is toenable low power implementation without losing performance.A novel architecture of the FFT processor for commutator isproposed using three stage 64 point radix-4 FFT. FIFO basedcommutator can be implemented in two ways using SR (ShiftRegister) or DM (Dual port RAM). This paper proposes an SR
This research work was supported by Telekom Malaysia Bhd known asTM. Project No. R03-0568 "OFDM Based Wireless LAN Processor".
1-4244-0011-2/05/$20.00 ©2005 IEEE.278
lilJ1l1IFFT~-[-[-l [f [ .f.: -/ - 0 - 0 • J -:
Figure I. Signal transformation using FFT/IFFT.
Figure 2. Transmitter and receiver block diagram for the OFDM PHY(the figure adopted from IEEE 802.lla Standard [5], pg. 24).
II. ALGORITHM
The 64-point radix-4 FFT of a finite duration sequence isgiven in [6] as:-
63
X(k) = Lx(n)W;:n=O
15 31 47 63
=Lx(n)W:' + Lx(n)w~n + Lx(n)w~n + Lx(n)w~nn=O n=16 n=32 n=4815 15
=Lx(n)W:: + Lx(n + l6)W~(n+16) +n=O n=O
15 15
L x(n + 32)W~(n+32) + L x(n + 48)W~(n+48)
n=O n=O
= f[x(n) + x(n + l6)W~6k + x(n + 32)W;"2k]W kn
n=O + x(n + 48)W~8k 64
= f[x(n) + (- j)k x(n + 16) + (_l)k x(n + 32)]W:: (1)n=O + (j)k x(n + 48)
each stage. The decomposition corresponds to the decimationin frequency computation.
III. ARCHITECTURE
A pipelined 64-point radix-4 processor based on the abovealgorithm is shown in Fig. 4. Each stage produces four outputsof butterfly element on each cycle. Each stage contains acommutator and a butterfly element.
where, W: =e-j2TIkn/ 64 =cos(211kn/ 64) - j sin(211kn/ 64) denotes the
twiddle factor representing k and n indexes; n is the time index;the k is the frequency index and j = H .
In this algorithm there are three (lo~64) stages involving64 point uniform radix-4 algorithmic processes with 16elements of radix-4 butterfly for each stage. The signal flowgraph (1) is shown in Fig. 3.
Conunutator1
Conunutator3
Figure 3. Signal flow graph of 64-point radix-4 FFT.
In Fig. 3, the first stage computes 64-input samples; thesecond stage computes 64-input data correspond to the firststage output; similar process is applied to the third stage (laststage); and finally, results from the last stage indicates theoutput samples. The dotted lines represent the boundaries of
Figure 4. 64-point radix-4 FFT processor.
IV. RESULT
The 64-point radix-4 FFT processor is implemented onVerilog-HDL (for RTL level model) and synthesized on XilinxFPGA devices (XCVI000E-8HQ240, X2V3000-6FFI152,X2V6000-6FF1152 and XC2VP30-7FFI152). The processor isverified for 16-bits data in-out and is compared against resultsobtained from MATLAB-simulation of the OFDM system.
The results obtained from the synthesis and simulations aresummarized in Table 1. Different targeted devices gavedifferent results. For instance, the XCVI000E-8HQ240,X2V3000-6FF1152 X2V6000-6FF1152 and XC2VP307FF1152 require 4,352 out of 12,288, 3,304 out of 14,336,3,304 out of 33,792 and 3,328 out of 13,696 SLICEs,respectively. The area utilization of each device is shown inFig. 5, Fig. 6, Fig. 7 and Fig. 8, respectively. The XC2VP307FF1152 gives the best processing speed of 32.74 millionsamples data throughputs per second and the worst speed is20.12 million samples per second given by the XCVI000E8HQ240. In addition, estimated power consumption of eachdevice ranges from 359 mW to 432 mW. The nominaldifferences of the power consumptions are due to differentarchitecture of the Xilinx device family (RAM and routingutilization).
As shown in Fig. 4, the first and second stage processes ofthe 64-point radix 4 FFT algorithm (as shown in Fig. 3) aredesigned using the butterfly element and commutator2accordingly. The last stage process of the FFT algorithm isdesigned using the butterfly element alone. The function of thecommutator1 and commutator2 are organized in serial toparallel order and parallel to serial order respectively.
:JI'istagj.,o
.I~~.~..... Ui;:12.:16.:I
~.--..;:~ 20~.~~ ....::.....-..~~..-;:IIIS
526
~..... -""",-_2.:1~,.~~~....... .:IO
5lIS12
_-- .... 26.....---.---::I~--.:I.:I
lIS01
.1 ._..... 17
;:1;:1.:IS'5
~.--..~__21;:175;:1S'
___............... 25~.~....--:::~........:Il
571;:1
~ ~.--~ 2S'.....--~~--.:I5
IISl2
.1 ._..... 16
;:1.:150lIS
~.--..~__ 22;:165.:110
...---.............._211S~.~....--:::~....... .:I2
561.:1
~ ~.--~ ;:IO.....--~~--.:IIIS
lIS 2;:I
",,-~~---lSi
;:15517_____..-..._ 2;:1
;:IS'5511
...---.............._27~.~....--:::~.....:I;:I
5S'15
~~.--.....~ ;:Il
.:17IIS;:I
1rtstag-:.o12;:I.:I5lIS76S'1011121;:11.:1151151716lSi2021222;:12.:1252lIS27262S';:10
~~ !X%~~;:1;:1;:1.:1;:15;:IllS;:17;:16;:IS'.:10.:11.:12.:1;:1.:1.:1.:15.:IllS.:17.:16.:IS'5051525;:15.:1555 lIS57565S'lIS0lIS1lIS 2IIS;:I
279
15 31 47 63
=Lx(n)W:' + Lx(n)w~n + Lx(n)w~n + Lx(n)w~nn=O n=16 n=32 n=4815 15
=Lx(n)W:: + Lx(n + l6)W~(n+16) +n=O n=O
15 15
L x(n + 32)W~(n+32) + L x(n + 48)W~(n+48)
n=O n=O
= f[x(n) + x(n + l6)W~6k + x(n + 32)W;"2k]W kn
n=O + x(n + 48)W~8k 64
= f[x(n) + (- j)k x(n + 16) + (_l)k x(n + 32)]W:: (1)n=O + (j)k x(n + 48)
each stage. The decomposition corresponds to the decimationin frequency computation.
III. ARCHITECTURE
A pipelined 64-point radix-4 processor based on the abovealgorithm is shown in Fig. 4. Each stage produces four outputsof butterfly element on each cycle. Each stage contains acommutator and a butterfly element.
where, W: =e-j2TIkn/ 64 =cos(211kn/ 64) - j sin(211kn/ 64) denotes the
twiddle factor representing k and n indexes; n is the time index;the k is the frequency index and j = H .
In this algorithm there are three (lo~64) stages involving64 point uniform radix-4 algorithmic processes with 16elements of radix-4 butterfly for each stage. The signal flowgraph (1) is shown in Fig. 3.
Conunutator1
Conunutator3
Figure 3. Signal flow graph of 64-point radix-4 FFT.
In Fig. 3, the first stage computes 64-input samples; thesecond stage computes 64-input data correspond to the firststage output; similar process is applied to the third stage (laststage); and finally, results from the last stage indicates theoutput samples. The dotted lines represent the boundaries of
Figure 4. 64-point radix-4 FFT processor.
IV. RESULT
The 64-point radix-4 FFT processor is implemented onVerilog-HDL (for RTL level model) and synthesized on XilinxFPGA devices (XCVI000E-8HQ240, X2V3000-6FFI152,X2V6000-6FF1152 and XC2VP30-7FFI152). The processor isverified for 16-bits data in-out and is compared against resultsobtained from MATLAB-simulation of the OFDM system.
The results obtained from the synthesis and simulations aresummarized in Table 1. Different targeted devices gavedifferent results. For instance, the XCVI000E-8HQ240,X2V3000-6FF1152 X2V6000-6FF1152 and XC2VP307FF1152 require 4,352 out of 12,288, 3,304 out of 14,336,3,304 out of 33,792 and 3,328 out of 13,696 SLICEs,respectively. The area utilization of each device is shown inFig. 5, Fig. 6, Fig. 7 and Fig. 8, respectively. The XC2VP307FF1152 gives the best processing speed of 32.74 millionsamples data throughputs per second and the worst speed is20.12 million samples per second given by the XCVI000E8HQ240. In addition, estimated power consumption of eachdevice ranges from 359 mW to 432 mW. The nominaldifferences of the power consumptions are due to differentarchitecture of the Xilinx device family (RAM and routingutilization).
As shown in Fig. 4, the first and second stage processes ofthe 64-point radix 4 FFT algorithm (as shown in Fig. 3) aredesigned using the butterfly element and commutator2accordingly. The last stage process of the FFT algorithm isdesigned using the butterfly element alone. The function of thecommutator1 and commutator2 are organized in serial toparallel order and parallel to serial order respectively.
:JI'istagj.,o
.I~~.~..... Ui;:12.:16.:I
~.--..;:~ 20~.~~ ....::.....-..~~..-;:IIIS
526
~..... -""",-_2.:1~,.~~~....... .:IO
5lIS12
_-- .... 26.....---.---::I~--.:I.:I
lIS01
.1 ._..... 17
;:1;:1.:IS'5
~.--..~__21;:175;:1S'
___............... 25~.~....--:::~........:Il
571;:1
~ ~.--~ 2S'.....--~~--.:I5
IISl2
.1 ._..... 16
;:1.:150lIS
~.--..~__ 22;:165.:110
...---.............._211S~.~....--:::~....... .:I2
561.:1
~ ~.--~ ;:IO.....--~~--.:IIIS
lIS 2;:I
",,-~~---lSi
;:15517_____..-..._ 2;:1
;:IS'5511
...---.............._27~.~....--:::~.....:I;:I
5S'15
~~.--.....~ ;:Il
.:17IIS;:I
1rtstag-:.o12;:I.:I5lIS76S'1011121;:11.:1151151716lSi2021222;:12.:1252lIS27262S';:10
~~ !X%~~;:1;:1;:1.:1;:15;:IllS;:17;:16;:IS'.:10.:11.:12.:1;:1.:1.:1.:15.:IllS.:17.:16.:IS'5051525;:15.:1555 lIS57565S'lIS0lIS1lIS 2IIS;:I
279
TABLE 1. SUMMARIZED RESULTS
Xilinxs' FPGA
Items \ Targeted Devices VirtexE VirtexII VirtexII VirtexII-Pro(XCVI000E- (X2V3000- (X2V6000- (XC2VP30-
88Q240)1 6FFl152)1 6FFl152)1 7FFl152)164-point radix-4 FFT processor (RTL level);
General Specifications Constant data throughputs for every clock cycle;Data latency 96 cycles; 16-bit complex word length
Voltage 1.8 volt / 3.3 volt / 1.5 volt / 3.3 volt / 1.5 volt / 3.3 volt / 1.5 volt / 2.5 volt /(Vccint/Vcco/Vccaux) n.a. 3.3 volt 3.3 volt 2.5 volt
Number of SLICEs4352 out of 12288 3304 out of 14336 3304 out of 33792 3328 out of 13696
(35%) (23%) (9%) (24%)
Number ofMULTl8Xl8s12 out of96 12 out of 144 12 out of 136
n.a.(12%) (8%) (8%)
Maximum system clock (fmax)20.12 MHz 24.57 MHZ 27.787 MHZ 32.74 MHz(49.700ns) (40.694ns) (35.988ns) (30.545ns)
Data throughputs20.12 million 24.57 million 27.787 million 32.74 million
samples/s samples/s samples/s samples/sEstimated power
414mW 359mW 360mW 432mWconsumption run at fmax
1 Available in the lab
v. CONCLUSION
This paper has presented a novel architecture to implementa pipelined 64-point radix-4 FFT processor suitable for OFDMsystem. The realization of radix-4 butterfly element with reused technique significantly reduces hardware complexity.Table 1 shows the estimated power consumption of 400 mWwhich is suitable for low power broadband systemrequirements. In the future, the results and performances of theprocessor can be further increased by targeting to ASICs.
ACKNOWLEDGMENT
The author would like to thank Dr. Zulkalnain Mohd.Yusof for discussion and support; and Mazlaini Yahya forreviewing this paper.
REFERENCES
[1] R. Van Nee and R. Prasad, "OFDM for Wireless MultimediaCommunications", Norwell, MA: Archtech House, 2000.
[2] W. C. Yeh and C. W. Jen, "High-Speed and Low-Power Split-RadixFFT", IEEE Trans. Signal Processing, Vol.51, No.3, Mar. 2003.
[3] P.S. Chow, J.C. Tu and J.M. Cioffi, "Performance Evaluation of aMultichannel Transceiver System for ADSL and VHDSL Services", IEEEJ. Selected Area, Vol. SAC-9, No.6, pp. 909-919, Aug. 1991.
[4] M. Yoshida, E. Ishizu, N. Yamashita and Y. Amezawa, "OFDMTransmission For lSI Channels Using Variable-Length Pilot Symbols AndPre-FFT Equalizer With Enhanced MRC Diversity Reception",GLOBECOM '03. IEEE, Volume: 4,1-5 Dec. 2003, pp:2290 - 2294 volA.
[5] IEEE 802.1 la, "High Speed Physical Layer in the 5GHz Band", 1999.
[6] J. W. Cooley and J. W. Tukey, "An Algorithm for the MachineCalculation of Complex Fourier Series", Math. Comput., Vol. 10, pp.297-301, April 1965.
[7] M. Hassan, T. Arslan and J. S. Thompson, "A Novel Coefficient Orderingbased Low Power Pipelined Radix-4 FFT Processor for Wireless LANApplications", IEEE Transactions on Consumer Electronics, vol.49, no.!,February 2003.
280
Figure 5. FFT processor physical placementfor the XCVI 000E-8HQ240 device.
Figure 6. FFT processor physical placementfor the X2V3000-6FF1152 device.
TABLE 1. SUMMARIZED RESULTS
Xilinxs' FPGA
Items \ Targeted Devices VirtexE VirtexII VirtexII VirtexII-Pro(XCVI000E- (X2V3000- (X2V6000- (XC2VP30-
88Q240)1 6FFl152)1 6FFl152)1 7FFl152)164-point radix-4 FFT processor (RTL level);
General Specifications Constant data throughputs for every clock cycle;Data latency 96 cycles; 16-bit complex word length
Voltage 1.8 volt / 3.3 volt / 1.5 volt / 3.3 volt / 1.5 volt / 3.3 volt / 1.5 volt / 2.5 volt /(Vccint/Vcco/Vccaux) n.a. 3.3 volt 3.3 volt 2.5 volt
Number of SLICEs4352 out of 12288 3304 out of 14336 3304 out of 33792 3328 out of 13696
(35%) (23%) (9%) (24%)
Number ofMULTl8Xl8s12 out of96 12 out of 144 12 out of 136
n.a.(12%) (8%) (8%)
Maximum system clock (fmax)20.12 MHz 24.57 MHZ 27.787 MHZ 32.74 MHz(49.700ns) (40.694ns) (35.988ns) (30.545ns)
Data throughputs20.12 million 24.57 million 27.787 million 32.74 million
samples/s samples/s samples/s samples/sEstimated power
414mW 359mW 360mW 432mWconsumption run at fmax
1 Available in the lab
v. CONCLUSION
This paper has presented a novel architecture to implementa pipelined 64-point radix-4 FFT processor suitable for OFDMsystem. The realization of radix-4 butterfly element with reused technique significantly reduces hardware complexity.Table 1 shows the estimated power consumption of 400 mWwhich is suitable for low power broadband systemrequirements. In the future, the results and performances of theprocessor can be further increased by targeting to ASICs.
ACKNOWLEDGMENT
The author would like to thank Dr. Zulkalnain Mohd.Yusof for discussion and support; and Mazlaini Yahya forreviewing this paper.
REFERENCES
[1] R. Van Nee and R. Prasad, "OFDM for Wireless MultimediaCommunications", Norwell, MA: Archtech House, 2000.
[2] W. C. Yeh and C. W. Jen, "High-Speed and Low-Power Split-RadixFFT", IEEE Trans. Signal Processing, Vol.51, No.3, Mar. 2003.
[3] P.S. Chow, J.C. Tu and J.M. Cioffi, "Performance Evaluation of aMultichannel Transceiver System for ADSL and VHDSL Services", IEEEJ. Selected Area, Vol. SAC-9, No.6, pp. 909-919, Aug. 1991.
[4] M. Yoshida, E. Ishizu, N. Yamashita and Y. Amezawa, "OFDMTransmission For lSI Channels Using Variable-Length Pilot Symbols AndPre-FFT Equalizer With Enhanced MRC Diversity Reception",GLOBECOM '03. IEEE, Volume: 4,1-5 Dec. 2003, pp:2290 - 2294 volA.
[5] IEEE 802.1 la, "High Speed Physical Layer in the 5GHz Band", 1999.
[6] J. W. Cooley and J. W. Tukey, "An Algorithm for the MachineCalculation of Complex Fourier Series", Math. Comput., Vol. 10, pp.297-301, April 1965.
[7] M. Hassan, T. Arslan and J. S. Thompson, "A Novel Coefficient Orderingbased Low Power Pipelined Radix-4 FFT Processor for Wireless LANApplications", IEEE Transactions on Consumer Electronics, vol.49, no.!,February 2003.
280
Figure 5. FFT processor physical placementfor the XCVI 000E-8HQ240 device.
Figure 6. FFT processor physical placementfor the X2V3000-6FF1152 device.
Figure 7. FFT processor physical placementfor the X2V6000-6FF1152 device.
Figure 8. FFT processor physical placementfor the XC2VP30-7FF 1152 device.
281
Figure 7. FFT processor physical placementfor the X2V6000-6FF1152 device.
Figure 8. FFT processor physical placementfor the XC2VP30-7FF 1152 device.
281