OFDM Transmitter Chain

OFDM baseband transceiver data chain – Implementation of an 64 point IFFT module for 802.11a OFDM specification

Team: Hemanth Bettachar(4388014), Saurabh Patodia (4450683)

Abstract: OFDM is the preferred multi-carrier transmission

scheme used in most communication systems today. In the OFDM

scheme, QAM, QPSK, etc. symbols in the spectral space are

mapped to orthogonal sub-carriers in the time domain. Hence, it

enables efficient spectrum utilization and high data throughput

with low inter-symbol interference. The orthogonality of the

sub-carriers is achieved by the IFFT/FFT module in the

Transmitter/Receiver data chain. The primary goal of this

project is to explore various architectures for an efficient

implementation of the 64-point IFFT core for the 802.11

Wireless Standard specifications. We aim to compare the

performance, in terms of area, power and speed, of the flat 6-

stage 64-point IFFT core architecture with the 1-stage 64-point

folded IFFT core architecture. Our implementation scope

encapsulates the architectural conception of the IFFT core,

System Verilog implementation and verification of the

fundamental blocks, integration of the fundamental blocks to

generate the top-level IFFT core, functional verification of

the top-level IFFT Core using MATLAB co-simulation, synthesis

and testability analysis of the IFFT core.

INTRODUCTION

OFDM Transmitter Chain

The above figure represents the OFDM architecture for most

applications like 802.1 WLAN, GPS and Communication Systems. The

main focus of the project is to develop the radix-2 64-point 16-

bit IFFT/FFT core, which forms the heart of the OFDM transmitter

chain. The 64-point IFFT/FFT core developed as part of this

project can be used in the OFDM Transmitter chain for all the

mentioned applications. We are targeting the 802.11a Wireless

Standard as its primary use-case. It receives 64 complex samples

of 16-bit data in the frequency-domain from the system data bus,

in parallel, performs the 64-point IFFT (Fast Fourier Transform)

and dumps the 64-point 16-bit time-domain complex outputs to the

next block in the OFDM Transmitter data-path, in parallel.

OFDM is a multi-carrier transmission scheme, which divides

the available spectrum into multiple carriers, where each one of

them is modulated by a low data rate encoding scheme. The primary

advantage of using OFDM as the means of data transmission at the

physical link layer is the fact that it is highly efficient in

terms of spectrum-reuse. The sub-carriers used in the

transmission scheme are synthesized orthogonal to each other,

thus enabling them to be placed very close to each other in the

spectrum, boosting the throughput of the system, while avoiding

interference. The orthogonality of the carriers is guaranteed by

the IFFT module in the transmitter chain. With the introduction

of powerful and cheap DSP processors, it is now feasible to

efficiently implement the IFFT algorithm in hardware. IFFT is

computationally far more efficient compared to IDFT, because

incoming data-stream can be processed in parallel, thus making it

the preferred implementation scheme.

This project aims to develop an efficient complex 64 point

16-bit IFFT core. This core could be used in any OFDM application

for generating the 64-sample time domain output from 64-sample

frequency domain input. It uses the butterfly structure (shown

below) as its fundamental building block to generate the 64-point

DIT-IFFT efficiently.

Butterfly structureWe have streamlined our core for the 802.11a standard and

the specifications are as below:

Features Our FFT/IFFT core 802.11a Spec

64 point FFT Supported Supported

Forward/Inverse FFT Supported Supported

Data Rates

(Mbits/sec)

89600 (for flat 6

stage DIT-FFT

architecture)

6-54

Input Word Length 16 bits supported 12 bits

Output Word Length 16 bits supported 14 bits

Twiddle Word Length 16 buts supported 10 bits

Self-Sorted Outputs Supported Supported

The above analysis shows that our FFT/IFFT core can be used for

802.11a WLAN applications. The block receives 64 complex samples

in parallel from the Serial-to-Parallel converter, processes

these samples in parallel through the 6-stage, 16-bit fixed-point

butterfly stages and generates the time-domain output. This

output is then fed to the parallel-to-serial converter in the

OFDM transmitter data-path.

Our 64-point IFFT core architecture:

6-stage 16-point FFT/IFFT core flat-architecture

Previous Work:

The previous work in this domain has been aimed primarily towards

architectural exploration of efficient FFT/IFFT cores for various

use-cases and applications. The work has ranged from developing

low power architectures for 64-point IFFT/FFT cores to developing

highly efficient parallel-pipelined architectures for high-speed

and high-performance applications. Mixed-Radix and multi-path

delay FFT architectures have also been explored in some detail to

demonstrate the trade-off between area, hardware-reuse and

processing time, while meeting the system specifications.

Considerable research effort has also gone in developing proto-

types for FFT architectures employing folding transformation and

register minimization techniques.

References:

i. An approach to simplify the Design of IFFT/FFT cores of OFDM systems Ainhoa Cortés, Igone Vélez, Juan F. Sevillano, and Andoni Irizar, IEEE Transactions on Consumer Electronics, Vol. 52, No. 1, FEBRUARY 2006

ii. A Low-Power 64-Point FFT/IFFT Design for IEEE 802.11a WLAN Application, Chin-Teng Lin, Yuan-Chu Yu, and Lan-Da Van, ISCAS 2006

iii. A Novel 64- point FFT/IFFT Processor For IEEE 802.11(A) standard, K. Maharatna, E. Grass, U.Jagdhold, ICASSP 2003

iv. An efficient OFDM Transceiver Design suitable to IEEE 802.11a WLAN Standard, T. Suresh, K.L.Shunmuganathan, , IJCSIS 2010

v. Design of an FFT/IFFT Processor for MIMO OFDM Systems, Yu-Wei Lin and Chen-Yi Lee, IEEE transactions on circuits and systems, 2007

vi. Pipelined Parallel FFT architectures via Folding Transformation, Manohar Ayinala, Michael Brown, Keshab K. Parhi, IEEE TRANSACTIONS on VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2011.

vii. http://www.ie.u-ryukyu.ac.jp/~wada/design07/spec_e.html viii. Keshab K. Parhi , VLSI Digital Signal Processing Systems: Design and

Implementationix. Milos D. Ergegovac, Tomas Lang , Digital Arithmetic x. John G Proakis, Dimitris G. Manolakis, Digital Signal Processing:

Principles, Algorithms and Applications

http://www.ie.u-ryukyu.ac.jp/~wada/design07/spec_e.html

PROPOSED ARCHITECTURE

Overall system block diagram of 802.11 OFDM:

Our Proposed FFT/IIFT Core:

Interface of our FFT/IFFT core to overall OFDM Transmitter chain is shown below:

64pt IFFT Block

rst_n

clk

Stage1

With 32 Butterflymodules

Stage2


Stage3


Stage4


Stage5


Stage6


Multplier16x16and

round to16bit

Twiddle Register array

32x2x16bit

real_ input_ 64x16bit

imag_input_64x16bit

real_output_64x16bit

imag_output_64x16b

Controller

FFT/IFFT CORE BLOCK DESCRIPTION

TABLE OF PINS:

Signal name I/O Description

clk Input Clock signal

rst_n Input Asynchronous Reset signal to reset the state

machine and FFT computation.

real_input_ Input (64x16bit bus) Real values for Complex Input Samples

imag_input_ Input (64x16bit bus) Imaginary values for Complex Input Samples

real_output_ Output (64x16bit bus) Real Values for Complex Output Samples

imag_output_ Output (64x16bit

bus)

Imaginary values for Complex Output

Samples

FUNCTIONAL DESCRIPTION OF MAJOR BLOCKS:

64-POINT IFFT BLOCK (TOP MODULE):

This is the top level module of our design. This module computes 64 point Inverse FFT on

complex inputs and provides complex outputs. The twiddle co-efficients are read in the reset

state from the Twiddle-ROM file. The 64 point IFFT is computed in parallel using 192 butterfly

structures connected in 6 Stages of 32 butterflies each. The final output is compute by

multiplying 6th stage outputs by 1/64 value stored in internal register. Once the IFFT

computation is done the 64 point complex IFFT values are sent out in parallel in the same clock

cycle as Y[0].real, Y[0].imaginary,Y[1].real, Y[1].imaginary.....Y[63].real and Y[64].imaginary.

SUBMODULES OF 64-POINT IFFT BLOCK (TOP MODULE):

Controller: The controller block controls the two states of the IFFT block: 'reset_state'

and 'processing_state'. In the 'reset_state', active low rst_n is asserted and the IFFT

block is idle. In the 'processing_state', the rst_n is de-asserted and the IFFT block

computes the 64-point inverse Fourier transform every clock cycle.

Multplier16x16: This is the final stage of IFFT block. It normalizes the 64-point 16-bit

complex output samples at stage-6 by 64, using a 16x16 signed Booth multiplier.

Stage 1 to Stages 6: Each stage employs 32 instances of 2-point complex butterfly

computation units.

2-point_butterfly structure : The butterfly structure computes the complex outputs

Y1 and Y2 ( Y1 = X1 + X2 * Wx and Y2 = X1 – X2 * Wx ). This structure is instantiated

192 times to compute the 64-point 16-bit complex IFFT values.

C21

C11

X1.r

X1.i

S11

C21

C11

S21

S11

S3

C3 W

.i

X2.i

S4

C4

W.r

X2.r

S1

C1

W.r

C2X2.i

S2

W.i

4:2Compressor

32bit

4:2 compressor

32 bit

3:2 compressor

32 bit

3:2 compressor

32 bit

Kogge-StoneAdder 32 bit+roundto16b


S31

C31

S31

C31

Y1.r

Y1.i

Y1

3:2 compressor

32 bit

3:2 compressor

32 bit

Kogge-StoneAdder 32 bit +roundto16b


S31

C31

S31

C31

Y2.r

Y2.i

Y2

X1.r

X1.i

S21

X1.r, X1.i

SUB-MODULES OF BUTTERFLY STRUCTURE:

16bit Multiplier block: The block gets the complex inputs X1.r, X1.i , X2.r and X2.i

from the complex input buffer and the complex twiddle values W.r and W.i from the Twiddle

register array.

This block computes the signed multiplications X2.r * W.r , X2.i * W.i ,X2.r * W.i and

X2.i * W.r .The intermediate 36 bit output of this multiplier is preserved in 32 bit Sum and Carry

format {S1,C1}{S2,C2}{S3,C3}{S4,C4} to reduce overflow and rounding errors .

4:2 Compressor 32bit: This block compresses the 32 bit Sum and Carry outputs of the

multipliers to compute X2.r * W.r - X2.i * W.i and X2.r * W.i + X2.i * W.i in 32 bit Sum and

Carry format {S11, C11} and{S21, C21} respectively.

3:2 Compressor 32bit: This block compresses the 32 bit Sum and Carry outputs of the

4:2 compressors and X1 real and imaginary inputs, to compute X1.r + (X2.r * W.r - X2.i *

W.i) , X1.i + (X2.r * W.i + X2.i * W.i) and X1.r - (X2.r * W.r - X2.i * W.i) , X1.i - (X2.r * W.i +

X2.i * W.i) in 32 bit Sum and Carry format {S31,C31},{S32,C32},{S33,C33} and {S34,C34}

respectively.

KoggeStone Adder 32bit: This high speed prefix adder computes the final real and

complex outputs by vector merging of {S31,C31},{S32,C32},{S33,C33} and {S34,C34} 3:2

compressor outputs. This block also performs 32bit to 16bit rounding and outputs the 16bit

butterfly values Y1.r, Y1.i , Y2.r and Y2.i.

VERILOG CODES

/**********************************************************************

Module butterfly: This module computes the 16bit signed complex outputs Y1 = X1 + X2 * Wx and Y2 = X1 – X2 * Wx from 16-bit signed complex inputs X1 and X2. The block uses four 16bit signed multiplier units, four 4:2 32bit . The multiplier outputs are 32-bit sum and carry outputs which are combined using 4:2 32-bit compression followed by 3:2 32 bit compression. Then the final 16-bit signed outputsY1.r, Y1.i, Y2.r and Y2.i are computed by 32bit vector merging using high speed Kogge-Stone Adder and a 32-bit to 16-bit rounding stage. It is to be noted that for maintaining precision, rounding to 16-bit is postponed to the very last stage. Also subtractions are optimized by using the same hardware by applying complemented inputs and adding 1’b1 at LSB positions at available empty slots in the Wallace tree structure or Vector merging structure. inputs : 16-bit signed Complex X1.r + jX1.i, X2.r + jX2.i , W.r + jW.i(twiddle_coefficient)outputs : 16-bit signed Complex Y1.r + jY2.i, Y2.r + jY2.i

***********************************************************************/

typedef struct { //complex data type Structure definitionreg signed [15:0]r, i; // real and imaginary parts} complex;

module butterfly(input complex X1,X2,W, output complex Y1,Y2);

wire [31:0] S1,C1,S2,C2,S3,C3,S4,C4,S11,C11,S21,C21,S31,C31,S32,C32,S33,C33,S34,C34;wire [32:0] Y1_r_temp1, Y1_i_temp1, Y2_r_temp1, Y2_i_temp1;reg [16:0] Y1_r_temp2, Y1_i_temp2, Y2_r_temp2, Y2_i_temp2;wire [32:0] Y2_i_temp1_r;reg [32:0] Y2_i_temp2_r;

//Computing real part A.r = X2.r * W.r - X2.i * W.i//Subtraction is done by inverting S2 and C2 and adding 1’b1 to LSB position, //other 1’b1 to be added is postponed //to next Wallace tree compression as //there is not slot left.booth16f mult1(X2.r, W.r , S1 , C1); //16-bit multiplication generates 32-bit Sum and Carrybooth16f mult2(X2.i, W.i , S2 , C2); //16-bit multiplication generates 32-bit Sum and Carrywallace4to2 wallace11(S1, ~S2, {C1[31:1],1'b1}, ~C2, S11 ,C11); //32-bit Sum and Carry for 4 input 32-bit PPs

//Computing real output Y1.r = X1.r + A.r //1’b1 which is postponed from the previous stage is also added to the LSB //position

//Input X1.r is added to the bit positions [30:14] and sign extension is //added at bit position 31.wallace3to2 wallace31(S11,{C11[31:1],1'b1}, {X1.r[15],X1.r,15'h0000}, S31, C31 ); //3:2 32-bit compressionkogge_stone_adder KS31(Y1_r_temp1, S31, C31); //Vector merging of 32-bit //inputs

//Rounding Scheme : 32 bit output to 16bit output(future scope).assign Y1.r = (Y1_r_temp1[14:0] >= (15'h4000)? ((Y1_r_temp1[30]== 1'b0)? ( Y1_r_temp1[30:15]+1): (Y1_r_temp1[30:15])) : Y1_r_temp1[30:15]);

//Computing real output Y2.r = X1.r - A.r//Subtraction is done by inverting S2 and C2 and adding 1’b1 to LSB positions //twice //Input X1.r is added to the bit positions [30:14] and sign extension is //added at bit position 31.wallace3to2 wallace32(~S11,{~C11[31:1],1'b1}, {X1.r[15],X1.r,15'h0000}, S32, C32 ); //3:2 32-bit compressionkogge_stone_adder KS32(Y2_r_temp1, S32, {C32[31:1],1'b1}); //Vector merging of 32-bit inputs

//Rounding Scheme : 32 bit output to 16bit output(future scope)assign Y2.r = (Y2_r_temp1[14:0] >= (15'h4000)? ((Y2_r_temp1[30]== 1'b0)? ( Y2_r_temp1[30:15]+1): (Y2_r_temp1[30:15])) : Y2_r_temp1[30:15]);

//Computation of complex parts A.i = X2.r * W.i + X2.i * W.rbooth16f mult3(X2.r, W.i , S3 , C3); //16-bit multiplication generates 32-bit Sum and Carrybooth16f mult4(X2.i, W.r , S4 , C4); //16-bit multiplication generates 32-bit Sum and Carrywallace4to2 wallace21(S3, S4, C3, C4, S21 ,C21); //32-bit Sum and Carry for 4 input 32-bit PPs

//Computing imaginary output Y1.i = X1.i + A.i//Input X1.i is added to the bit positions [30:14] and sign extension is added at bit position 31.wallace3to2 wallace33(S21,C21, {X1.i[15],X1.i,15'h0000}, S33, C33 ); //3:2 32-bit compressionkogge_stone_adder KS33(Y1_i_temp1, S33, C33); //Vector merging of 32-bit inputs

//Rounding Scheme : 32 bit output to 16bit output(future scope).assign Y1.i = (Y1_i_temp1[14:0] >= (15'h4000)? ((Y1_i_temp1[30]== 1'b0)? ( Y1_i_temp1[30:15]+1): (Y1_i_temp1[30:15])) : Y1_i_temp1[30:15]);

//Computing imaginary output Y2.i = X1.i - A.i//Input X1.i is added to the bit positions [30:14] and sign extension is //added at bit position 31.wallace3to2 wallace34(~S21, ~C21 , {X1.i[15],X1.i,15'h0001}, S34, C34 ); //3:2 //32-bit compressionkogge_stone_adder KS34(Y2_i_temp1, S34, {C34[31:1],1'b1});//Vector merging of 32-bit inputs

//Rounding Scheme: 32 bit output to 16bit output(future scope).assign Y2.i = (Y2_i_temp1[14:0] >= (15'h4000)? ((Y2_i_temp1[30]== 1'b0)? (Y2_i_temp1[30:15]+1): (Y2_i_temp1[30:15])) : Y2_i_temp1[30:15]);endmodule

/**********************************************************************Module booth16f : This module computes the product of two 16-bit signed numbers and outputs two 32 bit sum and carry values. This is required in the butterfly module for doing intermediate 32-bit computation for preserving precision and for improving the computation speed of the butterfly as vector merging is postponed to last stage of butterfly computation. This module uses booth encoding scheme followed by 4-stage Wallace Tree structure. The Wallace tree structure is optimized by avoiding sign extension of the partial products. Using the concept that ‘a4a4a4a4a4a3a2a1 = 0000a4'a3a2a1 + 11110000’ and adjusting at 11110000 addition in the last row along with partial product u.Signed inputs : x[15:0], y[15:0]Signed output : sum_op[31:0], carry_op[31:0]**********************************************************************/module booth16f(x, y, sum_op, carry_op);

input [15:0] x, y;output [31:0] sum_op, carry_op;reg [16:0] a, b, c, d , e , f , g , h ;reg [31:0] u;wire [14:0] s11; wire [15:1] c11; wire [14:0] s12; wire [15:1] c12;wire [11:0] s13; wire [12:1] c13; wire [14:0] s21; wire [15:1] c21;wire [15:0] s22; wire [16:1] c22; wire [18:0] s31; wire [19:1] c31;wire [31:0] s41; wire [31:0] c41;

//b(2i+1) b(2i) b(2i-1) bi'(recoded value) //BOOTH ENCODING ALGO// 0 0 0 0// 0 0 1 1// 0 1 0 1// 0 1 1 2// 1 0 0 -2// 1 0 1 -1// 1 1 0 -1// 1 1 1 -0

always @(x or y) begin u = 32'hAAAB0000;case (y[1:0]) //Generation of encoded PP12'b00 : begin a = 17'b00000000000000000; u[0] = 0; end // 02'b01 : begin a = {x[15], x[15:0]}; u[0] = 0; end // 12'b10 : begin a = {~x[15:0], 1'b1}; u[0] = 1; end // -22'b11 : begin a = {~x[15], ~x[15:0]}; u[0] = 1; end // -1endcasecase (y[3:1]) //Generation of encoded PP23'b000 : begin b = 17'b00000000000000000; u[2] = 0; end // 03'b001 : begin b = {x[15], x[15:0]}; u[2] = 0; end // 13'b010 : begin b = {x[15], x[15:0]}; u[2] = 0; end // 13'b011 : begin b = {x[15:0], 1'b0}; u[2] = 0; end // 23'b100 : begin b = {~x[15:0], 1'b1}; u[2] = 1; end // -23'b101 : begin b = {~x[15], ~x[15:0]}; u[2] = 1; end // -13'b110 : begin b = {~x[15], ~x[15:0]}; u[2] = 1; end // -13'b111 : begin b = 17'b00000000000000000; u[2] = 0; end // 0endcasecase (y[5:3]) //Generation of encoded PP3

3'b000 : begin c = 17'b00000000000000000; u[4] = 0; end // 03'b001 : begin c = {x[15], x[15:0]}; u[4] = 0; end // 13'b010 : begin c = {x[15], x[15:0]}; u[4] = 0; end // 13'b011 : begin c = {x[15:0], 1'b0}; u[4] = 0; end // 23'b100 : begin c = {~x[15:0], 1'b1}; u[4] = 1; end // -23'b101 : begin c = {~x[15], ~x[15:0]}; u[4] = 1; end // -13'b110 : begin c = {~x[15], ~x[15:0]}; u[4] = 1; end // -13'b111 : begin c = 17'b00000000000000000; u[4] = 0; end // 0endcasecase (y[7:5]) //Generation of encoded PP43'b000 : begin d = 17'b00000000000000000; u[6] = 0; end // 03'b001 : begin d = {x[15], x[15:0]}; u[6] = 0; end // 13'b010 : begin d = {x[15], x[15:0]}; u[6] = 0; end // 13'b011 : begin d = {x[15:0], 1'b0}; u[6] = 0; end // 23'b100 : begin d = {~x[15:0], 1'b1}; u[6] = 1; end // -23'b101 : begin d = {~x[15], ~x[15:0]}; u[6] = 1; end // -13'b110 : begin d = {~x[15], ~x[15:0]}; u[6] = 1; end // -13'b111 : begin d = 17'b00000000000000000; u[6] = 0; end // 0endcasecase (y[9:7]) //Generation of encoded PP53'b000 : begin e = 17'b00000000000000000; u[8] = 0; end // 03'b001 : begin e = {x[15], x[15:0]}; u[8] = 0; end // 13'b010 : begin e = {x[15], x[15:0]}; u[8] = 0; end // 13'b011 : begin e = {x[15:0], 1'b0}; u[8] = 0; end // 23'b100 : begin e = {~x[15:0], 1'b1}; u[8] = 1; end // -23'b101 : begin e = {~x[15], ~x[15:0]}; u[8] = 1; end // -13'b110 : begin e = {~x[15], ~x[15:0]}; u[8] = 1; end // -13'b111 : begin e = 17'b00000000000000000; u[8] = 0; end // 0endcasecase (y[11:9]) //Generation of encoded PP63'b000 : begin f = 17'b00000000000000000; u[10] = 0; end // 03'b001 : begin f = {x[15], x[15:0]}; u[10] = 0; end // 13'b010 : begin f = {x[15], x[15:0]}; u[10] = 0; end // 13'b011 : begin f = {x[15:0], 1'b0}; u[10] = 0; end // 23'b100 : begin f = {~x[15:0], 1'b1}; u[10] = 1; end // -23'b101 : begin f = {~x[15], ~x[15:0]}; u[10] = 1; end // -13'b110 : begin f = {~x[15], ~x[15:0]}; u[10] = 1; end // -13'b111 : begin f = 17'b00000000000000000; u[10] = 0; end // 0endcasecase (y[13:11]) //Generation of encoded PP73'b000 : begin g = 17'b00000000000000000; u[12] = 0; end // 03'b001 : begin g = {x[15], x[15:0]}; u[12] = 0; end // 13'b010 : begin g = {x[15], x[15:0]}; u[12] = 0; end // 13'b011 : begin g = {x[15:0], 1'b0}; u[12] = 0; end // 23'b100 : begin g = {~x[15:0], 1'b1}; u[12] = 1; end // -23'b101 : begin g = {~x[15], ~x[15:0]}; u[12] = 1; end // -13'b110 : begin g = {~x[15], ~x[15:0]}; u[12] = 1; end // -13'b111 : begin g = 17'b00000000000000000; u[12] = 0; end // 0endcasecase (y[15:13]) ////Generation of encoded PP83'b000 : begin h = 17'b00000000000000000; u[14] = 0; end // 03'b001 : begin h = {x[15], x[15:0]}; u[14] = 0; end // 13'b010 : begin h = {x[15], x[15:0]}; u[14] = 0; end // 13'b011 : begin h = {x[15:0], 1'b0}; u[14] = 0; end // 23'b100 : begin h = {~x[15:0], 1'b1}; u[14] = 1; end // -23'b101 : begin h = {~x[15], ~x[15:0]}; u[14] = 1; end // -13'b110 : begin h = {~x[15], ~x[15:0]}; u[14] = 1; end // -1

3'b111 : begin h = 17'b00000000000000000; u[14] = 0; end // 0endcaseend

//************ WALLACE Tree structure sign extension is avoided by using the conceptsigned number a4a4a4a4a4a3a2a1 = 0000a4'a3a2a1 + 11110000 ***********************/

//stage1 //Generating the full-adder array for stage-1 (9:6 compression) //iteration 1 //Compression of rows 1,2 and 3 of PPsgenvar i;generate for(i=0; i <= 11 ;i = i+1) begin: fulladd11fulladd fa11_(s11[i], c11[i + 1], a[i + 4], b[i + 2], c[i] );endendgeneratefulladd fa11_12(s11[12], c11[13], ~a[16], b[14], c[12] );halfadd ha11_13(s11[13], c11[14], b[15], c[13]);halfadd ha11_14(s11[14], c11[15], ~b[16], c[14]);

//iteration 2 //Compression of rows 4,5 and 6 of PPsgenerate for(i=0;i<=11 ;i=i+1) begin: fulladd12fulladd fa12_(s12[i],c12[i+1], d[i + 4], e[i + 2],f[i] );endendgenerate

fulladd fa12_12(s12[12], c12[13], ~d[16], e[14], f[12] );halfadd ha12_13(s12[13], c12[14], e[15], f[13]);halfadd ha12_14(s12[14], c12[15], ~e[16], f[14]);

//iteration 3 //Compression of rows 7,8 and 9 of PPsfulladd fa13_0(s13[0], c13[1], g[2], h[0], u[14]);halfadd ha13_1(s13[1], c13[2], g[3], h[1] );fulladd fa13_2(s13[2], c13[3], g[4], h[2], 1'b1 );generate for(i=3;i<=11 ;i=i+2) begin: fulladd13fulladd fa13_(s13[i], c13[i+1], g[i+2], h[i], 1'b1);endendgenerategenerate for(i=4;i<=10 ;i=i+2) begin: halfadd13halfadd ha13_(s13[i], c13[i+1], g[i + 2], h[i] );endendgenerate

//stage2 //Generating the full-adder array for stage-2 (6:4 compression) //iteration 1 //Compression of rows 1,2 and 3 of Stage-1 Outputsgenerate for(i=0; i <= 3 ;i = i+1) begin: fulladd21_afulladd fa21_(s21[i], c21[i + 1], s11[i+2], c11[i + 2], d[i] );endendgenerategenerate for(i=4; i <=12; i = i+1) begin: fulladd21_bfulladd fa21_(s21[i], c21[i + 1], s11[i+2], c11[i + 2], s12[i-4] );

endendgeneratefulladd fa21_13(s21[13], c21[14], c[15], c11[15], s12[9] );halfadd ha21_14(s21[14], c21[15], ~c[16], s12[10]);

//iteration 2 //Compression of rows 4,5 and 6 of Stage-1 Outputsfulladd fa22_0(s22[0], c22[1], c12[2], g[0], u[12] );halfadd ha22_1(s22[1], c22[2], c12[3], g[1]);halfadd ha22_2(s22[2], c22[3], c12[4], s13[0]);generate for(i=3; i <= 13 ;i = i+1) begin: fulladd22fulladd fa22_(s22[i], c22[i+1], c12[i+2], s13[i-2], c13[i-2] );endendgenerate

fulladd fa22_14(s22[14], c22[15], g[14], h[12], c13[12] );fulladd fa22_15(s22[15], c22[16], g[15], h[13], 1'b1 );

//stage 3 //Generating the full-adder array for stage-3 (4:3 compression)//iteration1 //Compression of rows 1, 2 and 3 of //Stage-2 Outputsfulladd fa31_0(s31[0], c31[1], s21[2], c21[2], e[0] );fulladd fa31_1(s31[1], c31[2], s21[3], c21[3], e[1] );fulladd fa31_2(s31[2], c31[3], s21[4], c21[4], u[10] );fulladd fa31_3(s31[3], c31[4], s21[5], c21[5], c12[1] );generate for(i=4; i <= 12 ;i = i+1) begin: fulladd31fulladd fa31_(s31[i], c31[i+1], s21[i+2], c21[i+2], s22[i-4] );endendgeneratefulladd fa31_13(s31[13], c31[14], s12[11], c21[15], s22[9] );halfadd ha31_14(s31[14], c31[15], s12[12], s22[10] );halfadd ha31_15(s31[15], c31[16], s12[13], s22[11] );halfadd ha31_16(s31[16], c31[17], s12[14], s22[12] );halfadd ha31_17(s31[17], c31[18], f[15], s22[13] );halfadd ha31_18(s31[18], c31[19], ~f[16], s22[14] );

// stage 4 //Generating the full-adder array for stage-4 (3:2 compression)halfadd ha41_0(s41[0], c41[1], a[0], u[0] ); //Compression of rows //1,2 and 3 of Stage-3 Outputs assign s41[1] = a[1];assign c41[2] = 1'b0;fulladd ha41_2(s41[2], c41[3], a[2], b[0], u[2] );halfadd ha41_3(s41[3], c41[4], a[3], b[1] );halfadd ha41_4(s41[4], c41[5], s11[0], u[4] );halfadd ha41_5(s41[5], c41[6], s11[1], c11[1] );halfadd ha41_6(s41[6], c41[7], s21[0], u[6] );halfadd ha41_7(s41[7], c41[8], s21[1], c21[1] );halfadd ha41_8(s41[8], c41[9], s31[0], u[8] );halfadd ha41_9(s41[9], c41[10], s31[1], c31[1] );halfadd ha41_10(s41[10], c41[11], s31[2], c31[2] );halfadd ha41_11(s41[11], c41[12], s31[3], c31[3] );halfadd ha41_12(s41[12], c41[13], s31[4], c31[4] );generate for(i=13; i <= 26 ;i = i+1) begin: fulladd41

fulladd fa41_(s41[i], c41[i+1], s31[i-8], c31[i-8], c22[i-12] );endendgeneratefulladd fa41_27(s41[27], c41[28], c31[19], s22[15], c22[15] );fulladd fa41_28(s41[28], c41[29], ~g[16], h[14], c22[16] );halfadd ha41_29(s41[29], c41[30], h[15], 1'b1 );assign s41[30] = ~h[16];assign c41[31] = 1'b0;assign s41[31] = 1'b1;assign c41[0] = 1'b0;assign carry_op = {c41[31:1],1'b0}; // Final signed 32-bit SUM and CARRY generatedassign sum_op = s41;endmodule

/**********************************************************************Module ifftcore_64pt: This module computes the 64 point complex DIT-IFFT. This is the top-level module.This module implements a flat architecture of radix-2 DIT-IFFT Cooley-Tukey algorithm and is computed with 6 stages of butterfly structures, each stage containing 32 butterfly instances, totaling 192 butterfly instances. The output of the final butterfly stage is normalized by 64 using sixty-four 16bit booth multipliers. The input to the module is 128x16-bit parallel input [consisting of 64 real and 64 imaginary samples of 16-bit]. The 128x16-bit parallel output[consisting of 64 real and 64 imaginary samples of 16-bit] is computed in a single clock-cycle. Complex Signed inputs : 128x16 bit X[63:0] , X[k] = X[k].r + j X[k].iComplex Signed output : 128x16bit ifft_out[63:0], ifft_out[k] = ifft_out[k].r + ifft_out[k].i **********************************************************************/module ifftcore_64pt(input complex ifft_in[64], input clk, input rst_n, output complex ifft_out[64]);

complex W[64] ;reg [15:0] norm = 16'h0200; // Final IFFT Normalization (1/64 in hex)

//intermediate nodescomplex X[64], X1[64], X2[64], X3[64], X4[64], X5[64] ,Y[64];

//Stage 1 // 32-Butterfly instances in Stage-1 of DIT-IFFTgenvar i;generate for(i=0; i <= 1 ; i = i+1) begin: butt1_butterfly butt1_1(X[0+i], X[32+i], W[0], X1[0+32*i], X1[1+32*i]);butterfly butt1_2(X[16+i], X[48+i], W[0], X1[2+32*i], X1[3+32*i]);butterfly butt1_3(X[8+i], X[40+i], W[0], X1[4+32*i], X1[5+32*i]);butterfly butt1_4(X[24+i], X[56+i], W[0], X1[6+32*i], X1[7+32*i]);butterfly butt1_5(X[4+i], X[36+i], W[0], X1[8+32*i], X1[9+32*i]);butterfly butt1_6(X[20+i], X[52+i], W[0], X1[10+32*i], X1[11+32*i]);butterfly butt1_7(X[12+i], X[44+i], W[0], X1[12+32*i], X1[13+32*i]);butterfly butt1_8(X[28+i], X[60+i], W[0], X1[14+32*i], X1[15+32*i]);butterfly butt1_9(X[2+i], X[34+i], W[0], X1[16+32*i], X1[17+32*i]);butterfly butt1_10(X[18+i], X[50+i], W[0], X1[18+32*i], X1[19+32*i]);butterfly butt1_11(X[10+i], X[42+i], W[0], X1[20+32*i], X1[21+32*i]);

butterfly butt1_12(X[26+i], X[58+i], W[0], X1[22+32*i], X1[23+32*i]);butterfly butt1_13(X[6+i], X[38+i], W[0], X1[24+32*i], X1[25+32*i]);butterfly butt1_14(X[22+i], X[54+i], W[0], X1[26+32*i], X1[27+32*i]);butterfly butt1_15(X[14+i], X[46+i], W[0], X1[28+32*i], X1[29+32*i]);butterfly butt1_16(X[30+i], X[62+i], W[0], X1[30+32*i], X1[31+32*i]);endendgenerate

////## BUTTERFLY STAGE 2 ###########generate for(i=0; i <= 15 ; i = i+1) begin: butt2_butterfly butt2_1_(X1[4*i], X1[4*i+2], W[0], X2[4*i+0], X2[4*i+2]);butterfly butt2_2_(X1[4*i+1], X1[4*i+3], W[16], X2[4*i+1], X2[4*i+3]);endendgenerate

///#### BUTTERFLY STAGE 3 ###########generate for(i=0; i <= 7 ; i = i+1) begin: butt3_butterfly butt3_1_(X2[8*i], X2[8*i+4], W[0], X3[8*i+0], X3[8*i+4]);butterfly butt3_2_(X2[8*i+1], X2[8*i+5], W[8], X3[8*i+1], X3[8*i+5]);butterfly butt3_3_(X2[8*i+2], X2[8*i+6], W[16], X3[8*i+2], X3[8*i+6]);butterfly butt3_4_(X2[8*i+3], X2[8*i+7], W[24], X3[8*i+3], X3[8*i+7]);endendgenerate

///#### BUTTERFLY STAGE 4 ###########generate for(i=0; i <= 7 ;i = i+1) begin: butt41_butterfly butt41_(X3[i], X3[i+8], W[i*4], X4[i], X4[i+8]);butterfly butt42_(X3[i+16], X3[i+24], W[i*4], X4[i+16], X4[i+24]);butterfly butt43_(X3[i+32], X3[i+40], W[i*4], X4[i+32], X4[i+40]);butterfly butt44_(X3[i+48], X3[i+56], W[i*4], X4[i+48], X4[i+56]);endendgenerate

///#### BUTTERFLY STAGE 5 ###########generate for(i=0; i <= 15 ;i = i+1) begin: butt51_butterfly butt51_(X4[i], X4[i+16], W[i*2], X5[i], X5[i+16]);butterfly butt52_(X4[i+32], X4[i+48], W[i*2], X5[i+32], X5[i+48]);endendgenerate

///#### BUTTERFLY STAGE 6 ###########generate for(i=0; i <= 31 ;i = i+1) begin: butt6_butterfly butt6_(X5[i], X5[i+32], W[i], Y[i], Y[i+32]);endendgenerate//multiplication by 1/N genvar i;generate for(i=0;i<=63 ;i=i+1) begin: mult16bitmultiplier_16bit mult16bit_(norm, Y[i].r, ifft_out[i].r);multiplier_16bit mult16bitx_(norm, Y[i].i, ifft_out[i].i);

endendgenerate

integer j;parameter size = 128; // Initializes the Twiddle Co-efficient ROM File parameter filein = "twiddle_ifft_64pt_fixed.txt";// W[0].r, W[0].i, W[1].r, //W[1].i...reg [15:0] read_mem[0:size-1]; //ROM implementation (Future Scope)

always @(posedge clk, negedge rst_n) begin if (~rst_n) // Reset-State, reads the co-efficients form the ROM file into the internal memory begin //initialize twiddle co-efficients. $readmemh (filein, read_mem , 0 , size-1);//read coeffs from file

for (j = 0; j<size/2; j = j+1) begin W[j].r = read_mem[(2*j)]; W[j].i = read_mem[(2*j+1)]; end

end

else // compute ifft X = ifft_in; endendmodule

VERIFICATION STRATEGY

The verification was done using the bottom-up strategy. The fundamental building blocks for the

64-point IFFT TOP Module were tested stand-alone using exhaustive linear test-benches (1-bit

Half-adder/Full-adder, wallace4to2, wallace3to2, booth_multiplier_16bit, mult_prod_output,

koggestoneadder_32bit). These blocks were then integrated together to develop the butterfly

module (butterfly2), which was again tested stand-alone using random vectors fed from a File-

I/O based testbench (file generated from MATLAB). This butterfly structure was then

instantiated 192 times in 6 stages to form the wrapper for the 64-point IFFTcore. The top level

testing was done again based on the File-I/O based testbench. The input stimulus is dumped in a

file by the MATLAB tool, and this file is read by the IFFT_TOP testbench. The real and

imaginary complex samples are picked up by the Test-Bench and sent to the DUT (IFFT core).

The output of the core is again dumped in a file and this dump is compared to the dump from

MATLAB for the same set of inputs.

File-I/O Based Verification Strategy for the IFFT Core

//A Testbench for IFFT 64point modules/********************************************************************************This testbench module tests the top-level ifftcore_64pt module. This uses a file I/O based strategy for testing.The frequency domain complex input stimuli ‘fft_input.txt’ is generated from Matlab and is used in testbench to provide complex data-type test inputs to ifftcore_64pt module. The resulting complex data-type output samples from the ifftcore_64pt module is dumped to an output file ‘fft_output.txt’ which is compared with the corresponding Matlab IFFT output, in the co-simulation environment.*********************************************************************************/module tb_ifft64; complex X[64], Y[64]; // complex inputs and outputs//file I/O relatedinteger i;parameter size = 128; //64 * 2 samples real and imaginary samples parameter filein = "fft_input.txt"; // X[0].r, X[0].i, X[1].r, //X[1].i...........parameter fileout = "fft_output.txt";//Y[0].r, Y[0].i, Y[1].r, //Y[1].i...........reg [15:0] read_mem[0:size-1]; // Memory for Complex Input samplesreg [15:0] write_out_mem[0:size-1]; // Memory for Complex Output samplesreg clk;reg rst_n;

//instantiation of 64-point IFFT coreifftcore_64pt ifft64pt(X,clk, rst_n ,Y);

initial begin#0 clk = 1'b0;#0 rst_n = 1'b0;#0 // assign input values $readmemh (filein, read_mem , 0 , size-1); //read complex samples from file dumped by MATLAB for (i = 0; i<size/2; i = i+1) begin X[i].r = read_mem[(2*i)]; //Reading Real and Complex Values into the input array X[i].i = read_mem[(2*i+1)]; end #7 rst_n = 1'b1;// write output samples computed by the IFFT core into a file for#10 // Verificationfor (i =0; i<size/2; i = i+1) begin //Writing Real and Complex Values into the output arraywrite_out_mem[(2*i)] = Y[i].r;write_out_mem[(2*i+1)] = Y[i].i;end

$writememh (fileout, write_out_mem, 0 , size-1); //write output to file from //the output array$finish;end

always #2 // Driving Clock with Period 4nsclk=~clk;endmodule

RTL SIMULATION SNAPSHOT

The final comparison is being done manually at the moment, because the 16-bit fixed point arithmetic employed in our System Verilog IFFT core, results in truncation and rounding, whereas MATLAB has double precision outputs. However, we see the deviation is minimal as shown below:

The following TABLE gives the comparison between our System Verilog 64-point IFFT module and the 64-point IFFT MATLAB module:

INPUT PINS

INPUT DATA MATLAB IFFT MODEL OUTPUT

SYSTEM VERILOG IFFT CORE OUTPUT

Real Imaginary Real Imaginary Real ImaginaryX[0] 1 2 32 0 32 0X[1] 3 4 -22 -12 -22 -13X[2] 5 6 -1 -10 -2 -11X[3] 7 8 -7 0 -8 -1X[4] 9 10 -1 -5 -2 -6X[5] 11 12 -4 1 -5 1X[6] 13 14 -1 -3 -2 -4X[7] 15 16 -3 1 -3 1X[8] 17 18 -1 -2 -2 -2X[9] 19 20 -2 1 -2 1X[10] 21 22 -1 -2 -2 -2X[11] 23 24 -2 1 -2 1X[12] 25 26 -1 -1 -2 -2X[13] 27 28 -1 1 -2 1X[14] 29 30 -1 -1 -2 -2X[15] 31 32 -1 1 -2 1X[16] 33 34 -1 -1 -2 -1X[17] 35 36 -1 1 -1 1X[18] 37 38 -1 -1 -2 -1X[19] 39 40 -1 1 -1 1X[20] 41 42 -1 -1 -2 -1X[21] 43 44 -1 1 -1 1X[22] 45 46 -1 -1 -2 -1X[23] 47 48 -1 1 -1 1X[24] 49 50 -1 0 -2 -1X[25] 51 52 0 1 -1 1X[26] 53 54 -1 0 -2 -1X[27] 55 56 0 1 -1 1X[28] 57 58 -1 0 -2 -1X[29] 59 60 0 1 -1 1X[30] 61 62 -1 0 -2 -1X[31] 63 64 0 1 -1 1X[32] 1 -2 -1 0 -1 0X[33] 3 -4 0 1 0 1X[34] 5 -6 -1 0 -2 0X[35] 7 -8 0 1 0 1X[36] 9 -10 -1 0 -2 0X[37] 11 -12 0 1 0 1X[38] 13 -14 -1 0 -2 0X[39] 15 -16 0 1 0 1X[40] 17 -18 -1 0 -2 0X[41] 19 -20 1 1 0 1X[42] 21 -22 -1 1 -2 1

X[43] 23 -24 1 1 1 1X[44] 25 -26 -1 1 -2 1X[45] 27 -28 1 1 1 1X[46] 29 -30 -1 1 -2 1X[47] 31 -32 1 1 1 1X[48] 33 -34 -1 1 -1 1X[49] 35 -36 1 1 1 1X[50] 37 -38 -1 1 -2 1X[51] 39 -40 1 1 1 1X[52] 41 -42 -1 1 -2 2X[53] 43 -44 2 1 2 1X[54] 45 -46 -1 2 -2 2X[55] 47 -48 2 1 2 1X[56] 49 -50 -1 2 -1 2X[57] 51 -52 3 1 3 1X[58] 53 -54 -1 3 -2 3X[59] 55 -56 4 1 4 1X[60] 57 -58 -1 5 -1 5X[61] 59 -60 7 0 7 -1X[62] 61 -62 -1 10 -1 10X[63] 63 -64 22 -12 22 -13

The line coverage for the top-level is found to be: 19.43% (Testing to be automated. Will be done for next phase)

Convergence between MATLAB and System Verilog Models:

MATLAB-SYSTEM VERILOG IFFT CO-SIMULATION

SYNTHESIS REPORTS

The IFFT top was synthesized using the following script (compile_ifft.tcl):

#/* Compile Script for Synopsys */

set my_verilog_files [list ~/labs/ofdm/synthesis/full_adder_half_adder_1bit.v ~/labs/ofdm/synthesis/koggestoneadder_32bit.v ~/labs/ofdm/synthesis/booth_multiplier_16bit.v ~/labs/ofdm/synthesis/mult_Product_output.v ~/labs/ofdm/synthesis/wallace4to2.v ~/labs/ofdm/synthesis/wallace3to2.v ~/labs/ofdm/synthesis/butterfly_latest.v ~/labs/ofdm/synthesis/ifft_64pt.v]

set my_toplevel ifftcore_64ptset my_clock_pin clkset my_clk_freq_MHz 10set my_input_delay_ns 0.1set my_output_delay_ns 0.1set verilogout_show_unconnected_pins "true"set hdlin_sv_ieee_assignment_patterns 2 analyze -f sverilog $my_verilog_fileselaborate $my_toplevelset my_period [expr 1000 / $my_clk_freq_MHz]set find_clock [ find port [list $my_clock_pin] ]if { $find_clock != [list] } { set clk_name $my_clock_pin create_clock -period $my_period $clk_name} else { set clk_name vclk create_clock -period $my_period -name $clk_name}set_input_delay $my_input_delay_ns -clock $clk_name [remove_from_collection [all_inputs] $my_clock_pin]set_output_delay $my_output_delay_ns -clock $clk_name [all_outputs]

current_design wallace4to2linkuniquifycompile

current_design wallace3to2linkuniquifycompile

current_design kogge_stone_adderlink

uniquifycompileset_dont_touch [get_designs wallace4to2]set_dont_touch [get_designs wallace3to2]set_dont_touch [get_designs kogge_stone_adder]

current_design butterflylinkuniquifycompileset_dont_touch [get_designs butterfly]

current_design multiplier_16bitlinkuniquifycompileset_dont_touch [get_designs multiplier_16bit]

current_design $my_toplevelset_scan_configuration -style multiplexed_flip_flopcompile -scan -exact_map -map_effort medium

check_designreport_constraint -all_violators

set_scan_configuration -replace falsecreate_test_protocol -infer_async -infer_clockdft_drcinsert_dftset_dft_signal -view spec -type ScanEnable -port test_seset_dft_signal -view spec -type ScanDataIn -port test_siset_drive 2 test_seset_drive 2 test_s1

report_constraint -all_violators

set filename [format "%s%s" $my_toplevel ".vh"]write -format verilog -hierarchy -output $filename

set filename [format "%s%s" $my_toplevel ".sdc"]write_sdc $filename

report_area > area.rptreport_timing -significant_digits 4 -tran -delay max -max_path 10 -input -net > timing.rptreport_power > power.rptquit

The following were the reports generated:

a. Area Report:

****************************************Report : areaDesign : ifftcore_64pt

****************************************

Number of ports: 4098Number of nets: 33335Number of cells: 15269Number of references: 8

Combinational area: 1849277.203472Noncombinational area: 9261.055664Net Interconnect area: 642175.925509

Total cell area: 1858538.259136Total area: 2500714.184646

b. Power Report

****************************************Report : power -analysis_effort lowDesign : ifftcore_64pt****************************************

Global Operating Voltage = 0.95 Power-specific unit information : Voltage Units = 1V Capacitance Units = 1.000000pf Time Units = 1ns Dynamic Power Units = 1mW (derived from V,C,T units) Leakage Power Units = 1pW

Cell Internal Power = 620.4071 uW (62%) Net Switching Power = 372.7429 uW (38%) ---------Total Dynamic Power = 993.1500 uW (100%)Cell Leakage Power = 10.3207 mW

c. Timing Report

The most timing critical path in the IFFT core:

****************************************Report : timing -path full -delay maxDesign : ifftcore_64pt****************************************

Startpoint: ifft_in[63][r][11] (rising edge-triggered flip-flop clocked by clk)Endpoint: ifft_out[0][r][15] (output port clocked by clk)Path Group: clkPath Type: max

clock clk (rise edge) 100.0000 100.0000

clock network delay (ideal) 0.0000 100.0000output external delay -0.1000 99.9000data required time 99.9000data arrival time 89.7406

-------------------------------------------------------------------- slack (MET) 10.1594

Timing closed at 100ns (10 MHz)

CONCLUSION

Our IFFT core can take in 64 complex samples of 16-bit frequency-domain data

from the system bus in parallel and generate the corresponding 64 complex samples of

16-bit data in the time domain in parallel. The above sections have described our 64-point

IFFT core in detail. The architecture of the core and its implementation in terms of its

fundamental blocks (half adders, full adders, Wallace4to2, Wallace3to2, Kogge Stone

Adder, Booth encoded Multiplier and Butterfly structure) is explained in detail. The

verification is based on a File-I/O Testbench scheme. We have explored the flat 6-stage

architecture for the DIT-IFFT algorithm and we observe that it synthesizes to a large area

but is power-efficient as compared to recent implementations for 802.11 WLAN and also

the timing results meet the 802.11 WLAN specifications by a huge margin. So, we see a

scope to improve the area of the IFFT core, by trading-off the power and speed gain,

using other architectures.

NEXT STEPS FOR THE PROJECT:

- Folding transformation of the 6-butterfly stages into 1-stage, which will enable us to

re-use the hardware. This can improve the area of the IFFT core considerably,

compared to the present implementation, but it will also have an impact on the timing

of the critical path of the circuit, as there will be extra muxing and inter-stage glue

logic on the critical path.

- Pipelining the butterfly stages, so that we can further increase the frequency of

operation of the IFFT core.

- Automate the testing environment for rigorous testing of the IFFT core using random

stimuli dumped by MATLAB. The co-simulation can also be automated, where we

plan to mask the 3 LSB bits of the MATLAB rounded output, along with the System-

Verilog core output. Thus, we can get a PASS/FAIL assertion from the Test-Bench

without any human intervention.

POTENTIAL DESIGN ISSUES:

- Our 64-point IFFT core uses 16-bit fixed point arithmetic. Thus, after

multiplication and accumulation at every butterfly stage, the 32-bit results are rounded to

16-bits before being passed to the next stage. This rounding operation results in rounding

and truncation errors. These errors can get accumulated through the 6-butterfly stages in

the data-path and after the final normalization of the IFFT samples, the rounded sample

values can deviate from the expected output, as compared to the MATLAB dump. We

have implemented a rounding scheme by comparing the LSB 14 bits of the 32-bit

operation with 0.5(0x4000) and rounding the samples to the next highest integer. We

need to explore better rounding/truncation schemes in order to ensure we don’t lose

precision in the final IFFT outputs.

- The testing environment, at the moment, just reads a single stimuli file and there

is no mechanism for automated regressions. Thus, our coverage is low and we plan to

automate the verification environment to boost the line-coverage.

- Also, the Twiddle memory testing needs to be done exhaustively.

APPENDIX

/**********************************************************************Module wallace3to2: This module computes the sum of three 32 bit vectors and generate two 32 bit vectors sum and carry. inputs: sum_ip1[]31:0, sum_ip2[31:0], carry_ip1[31:0]outputs : sum_out[31:0], carry_out[31:0]***********************************************************************/

module wallace3to2(input [31:0] sum_ip1,sum_ip2,carry_ip1, output [31:0] sum_out, carry_out); genvar i;wire [32:0] carry_out_wr;

//Computes the 3:2 compression of inputs with 31 full adder modules.generate for(i=0; i <= 31; i = i+1) begin: fulladdsc32fulladd fasc32_(sum_out[i], carry_out_wr[i+1], sum_ip1[i], carry_ip1[i], sum_ip2[i] );endendgenerateassign carry_out_wr[0] = 0;assign carry_out[31:0] = carry_out_wr[31:0];endmodule

/**********************************************************************Module wallace4to2 : This module computes the sum of four 32 bit vectors and generate two 32 bit vectors sum and carry. inputs : sum_ip1[]31:0, sum_ip2[31:0], carry_ip1[31:0], carry_ip2[31:0]outputs : sum_out[31:0], carry_out[31:0]**********************************************************************/module wallace4to2(input [31:0] sum_ip1,sum_ip2,carry_ip1,carry_ip2, output [31:0] sum_out, carry_out); wire [32:0] carry_out_wr;wire [32:0] carry1;wire [31:0] sum1;

//iteration 1

genvar i;generate for(i=0; i <=31; i = i+1) begin: fulladdsc32fulladd fasc32_(sum1[i], carry1[i+1], sum_ip1[i], carry_ip1[i], sum_ip2[i] );endendgenerate

//iteration 2genvar i;generate for(i=0; i <=31; i = i+1) begin: fulladdsc32_iter_2fulladd fasc32_iter_2_(sum_out[i], carry_out_wr[i+1], sum1[i], carry1[i], carry_ip2[i] );endendgenerate

assign carry1[0] = 0;assign carry_out_wr[0] = 0;assign carry_out[31:0] = carry_out_wr[31:0]; endmodule

/**********************************************************************1 bit full-adder Module fulladd : inputs : a, b, c_in outputs : sum, c_out1 bit half-adder Module fulladd : inputs : a, b outputs : sum, c_out**********************************************************************/ // define 1-bit full addermodule fulladd(sum, c_out, a, b, c_in);output sum, c_out; // I/O declaration, wire type by defaultinput a, b, c_in;wire s1, c1, c2; // internal portsxor(s1, a, b);and(c1, a, b);xor(sum, s1, c_in);and(c2, s1, c_in);or(c_out, c2, c1);endmodule

// define 1-bit Half addermodule halfadd(sum, c_out, a, b );output sum, c_out; // I/O declaration, wire type by defaultinput a, b ;wire s1, c1, c2; // internal portsxor(sum, a, b);and(c_out, a, b);endmodule

/**********************************************************************Module kogge_stone_adder: This module computes the 32bit fast addition using high-speed Kogge-Stone Prefix adder architecture. The structures has 6stages .In the first stage the propagate and generate outputs are computed from inputs a[31:0] and b[31:0]. In the following 4 stages the group-propagate and group-generate outputs are computed. In the last XOR stage final output is computed. The implementation is optimized for CMOS implementation by using bubbled logic in alternate stages in-order to reduce the logic on the critical path.Signed inputs : ai[31:0], bi[31:0]Signed output : si[32:0]

**********************************************************************///32 bit Kogge-Stone Prefix addermodule xor_2(y,a,b); //xor gateinput a,b;output y;xor(y,a,b);endmodulemodule xnor_2(y,a,b); //xnor gateinput a,b;output y;xnor(y,a,b);endmodule

//bubbled output PG Generate cellmodule pg_generate_bubbled(_pi,_gi,ai,bi);output _pi,_gi;input ai,bi;xnor(_pi,ai,bi);nand(_gi,ai,bi);endmodule

//Group PG generator // Propagate and Generate for a group of signals (i:j)module group_pg_generator(pij, gij, _pik, _pk_1j, _gik, _gk_1j);output pij,gij;input _pik,_pk_1j,_gik,_gk_1j;wire t1,t2;nor(t1,_pik,_gk_1j);not(t2,t1);nand(gij,t2,_gik); //Gi:j = Gi:k + Pi:k . Gk-1:jnor(pij,_pik,_pk_1j); //Pi:j = Pi:k and Pk-1:jendmodule

//Group G generator // Generate for a group of signals (i:j)

module group_g_generator(gij,_pik,_gik,_gk_1j);output gij;input _pik,_gik,_gk_1j;wire t1,t2;nor(t1,_pik,_gk_1j);not(t2,t1);nand(gij,t2,_gik); //Gi:j = Gi:k + Pi:k . Gk-1:jendmodule

//Group PG generator bubbled // Inverted Propagate and Generate for a group of signals (i:j)module group_pg_generator_bubbled(_pij,_gij,pik,pk_1j,gik,gk_1j);output _pij,_gij;input pik,pk_1j,gik,gk_1j;wire t1,t2;nand(t1,pik,gk_1j);not(t2,t1);nor(_gij,t2,gik); //_Gi:j = (Gi:k + Pi:k and Gk-1:j)'nand(_pij,pik,pk_1j); //_Pi:j = (Pi:k and Pk-1:j)'endmodule

//Group G genertor bubbled // Inverted Generate for a group of signals (i:j)module group_g_generator_bubbled(_gij,pik,gik,gk_1j);output _gij;input pik,gik,gk_1j;wire t1,t2;nand(t1,pik,gk_1j);not(t2,t1);nor(_gij,t2,gik); //_Gi:j = (Gi:k + Pi:k and Gk-1:j)'endmodule

/// KOGGE STONE ADDER Treemodule kogge_stone_adder(si,ai,bi); //Generates the 32-bit signed sum from 32-bit signed operandsinput [31:0] ai,bi;output [32:0] si;wire [31:0] _pi,_gi,p1_ij,g1_ij, _p2_ij,_g2_ij, p3_ij,g3_ij,_p4_ij,_g4_ij,p5_ij,g5_ij;

// bubbled PG generate stage // Inverted Propagate and Generate for a group of signals (i:j)genvar i;generate for(i=0;i<32;i=i+1) begin: pg_generatepg_generate_bubbled pg_gen_bubbled(_pi[i],_gi[i],ai[i],bi[i]);endendgenerate

//Group PG generator stage 1 // Propagate and Generate for a group of signals (i:j)assign g1_ij[0]= ~_gi[0];assign p1_ij[0]= 1'b0;group_g_generator group1_g_gen1(g1_ij[1],_pi[1],_gi[1],_gi[0]);assign p1_ij[1]=1'b0;

generate for(i=2;i<32;i=i+1) begin: group1_pg_generategroup_pg_generator group1_pg_gen(p1_ij[i],g1_ij[i],_pi[i],_pi[i-1],_gi[i],_gi[i-1]);endendgenerate

//Group PG generator stage 2 // Inverted Propagate and Generate for a group of signals (i:j)assign _g2_ij[0]= ~g1_ij[0];assign _p2_ij[0]= 1'b0;assign _g2_ij[1]= ~g1_ij[1];assign _p2_ij[1]= 1'b0;

group_g_generator_bubbled group2_g_gen_bubbled2(_g2_ij[2],p1_ij[2],g1_ij[2],g1_ij[0]);assign _p2_ij[2]= 1'b0;group_g_generator_bubbled group2_g_gen_bubbled3(_g2_ij[3],p1_ij[3],g1_ij[3],g1_ij[1]);assign _p2_ij[3]= 1'b0;generate

for(i=4;i<32;i=i+1) begin: group2_pg_generate_bubbledgroup_pg_generator_bubbled group2_pg_gen_bubbled(_p2_ij[i],_g2_ij[i],p1_ij[i],p1_ij[i-2],g1_ij[i],g1_ij[i-2]);endendgenerate

//Group PG generator stage 3 // Propagate and Generate for a group of signals (i:j)assign g3_ij[0]= g1_ij[0];assign p3_ij[0]= 1'b0;assign g3_ij[1]= g1_ij[1];assign p3_ij[1]= 1'b0;assign g3_ij[2]= ~_g2_ij[2];assign p3_ij[2]= 1'b0;assign g3_ij[3]= ~_g2_ij[3];assign p3_ij[3]= 1'b0;

generate for(i=4;i<8;i=i+1) begin: group3_g_generategroup_g_generator group3_g_gen(g3_ij[i],_p2_ij[i],_g2_ij[i],_g2_ij[i-4]);endendgenerate

assign p3_ij[4]=1'b0;assign p3_ij[5]=1'b0;assign p3_ij[6]=1'b0;assign p3_ij[7]=1'b0;

generate for(i=8;i<32;i=i+1) begin: group3_pg_generategroup_pg_generator group3_pg_gen(p3_ij[i],g3_ij[i],_p2_ij[i],_p2_ij[i-4],_g2_ij[i],_g2_ij[i-4]);endendgenerate

//Group PG generator stage 4 // Inverted Propagate and Generate for a group of signals (i:j)assign _g4_ij[0]= _g2_ij[0]; assign _p4_ij[0]= 1'b0;assign _g4_ij[1]= _g2_ij[1];assign _p4_ij[1]= 1'b0;assign _g4_ij[2]= _g2_ij[2];assign _p4_ij[2]= 1'b0;assign _g4_ij[3]= _g2_ij[3];assign _p4_ij[3]= 1'b0;assign _g4_ij[4]= ~g3_ij[4];assign _p4_ij[4]= 1'b0;assign _g4_ij[5]= ~g3_ij[5];assign _p4_ij[5]= 1'b0;assign _g4_ij[6]= ~g3_ij[6];assign _p4_ij[6]= 1'b0;assign _g4_ij[7]= ~g3_ij[7];assign _p4_ij[7]= 1'b0;

generate for(i=8;i<16;i=i+1) begin: group4_g_generate_bubbled

group_g_generator_bubbled group4_g_gen_bubbled(_g4_ij[i],p3_ij[i],g3_ij[i],g3_ij[i-8]);endendgenerateassign _p4_ij[8]= 1'b0;assign _p4_ij[9]= 1'b0;assign _p4_ij[10]= 1'b0;assign _p4_ij[11]= 1'b0;assign _p4_ij[12]= 1'b0;assign _p4_ij[13]= 1'b0;assign _p4_ij[14]= 1'b0;assign _p4_ij[15]= 1'b0;

generate for(i=16;i<32;i=i+1) begin: group4_pg_generate_bubbledgroup_pg_generator_bubbled group4_pg_gen_bubbled(_p4_ij[i],_g4_ij[i],p3_ij[i],p3_ij[i-8],g3_ij[i],g3_ij[i-8]);endendgenerate

//stage 5 // Propagate and Generate for a group of signals (i:j)generate for(i=16;i<32;i=i+1) begin: group5_g_generategroup_g_generator group5_g_gen(g5_ij[i],_p4_ij[i],_g4_ij[i],_g4_ij[i-16]);endendgenerate

//stage 6 // Final SUM GENERATION (SUM.i = P.i xor C.i-1)assign si[0] = ~_pi[0];genvar i;generate for(i=1;i<17; i=i+1) begin: xor_2inputxor_2 xor_2ip(si[i],_g4_ij[i-1],_pi[i]);endendgenerategenvar i;generate for(i=17;i<32; i=i+1) begin: xnor_2input2xnor_2 xnor_2ip(si[i],g5_ij[i-1],_pi[i]);endendgenerateassign si[32] = g5_ij[31];endmodule

/**********************************************************************Module multiplier_16bit: This module computes the 16bit product and outputs the 32bit to 16bit rounded result. Signed inputs : inputs x[15:0], y[15:0]Signed output : outputs p[15:0]**********************************************************************/module multiplier_16bit(x, y, p);input [15:0] x, y;output [15:0] p;reg [16:0] a, b, c, d, e, f, g, h;

reg 31:0] u; wire [31:0] carry, sum; wire [32:0] p_temp; wire[14:0] s11; wire[15:1] c11; wire[14:0] s12; wire[15:1] c12; wire[11:0] s13;wire[12:1] c13; wire[14:0] s21; wire[15:1] c21; wire[15:0] s22; wire[16:1] c22; wire[18:0] s31; wire[19:1] c31; wire[31:0] s41; wire[31:0] c41;/////Code here is same as in module boot16f from line 6 to last but 3lines/////assign carry = {c41[31:1],1'b0};assign sum = s41;//vector merging.kogge_stone_adder KSSSSS(p_temp,carry,sum); //32 bit prefix adder

//rounding scheme for 32bit to 16bit conversion(future scope)assign p = (p_temp[14:0] >= (15'h4000)? ((p_temp[30]== 1'b0)? ( p_temp[30:15]+1): (p_temp[30:15])) : p_temp[30:15]);

endmodule

Documents

OFDM Transmitter Chain