Upload
m-v-n-v-prasad
View
180
Download
3
Embed Size (px)
Citation preview
FPGA implementation of Discrete Fractional Fourier Transform
M.V.N.V.Prasad†1, K. C. Ray†2 and A. S. Dhar‡
†Department of Electronics and Communication Engineering,
Indian Institute of Information Technology, Allahabad, Uttar Pradesh, India.
Email: [email protected], [email protected]
‡Department of Electronics and Electrical Communication Engineering,
Indian Institute of Technology, Kharagpur, West Bengal, India.
Email: [email protected]
Abstract– Since decades, fractional Fourier transform has taken a considerable attention for various applications in signal and image processing domain. On the evolution of fractional Fourier transform and its discrete form, the real time computation of discrete fractional Fourier transform is essential in those applications. On this context, we have proposed new hardware architecture for implementing a Discrete Fractional Fourier Transform (DFrFT) which requires hardware complexity of O(4N), where N is transform order. This proposed architecture has been simulated and synthesized using verilogHDL, targeting a FPGA device (XLV5LX110T). The simulation results are very close to the results obtained by using MATLAB. The result shows that, this architecture can be operated on a maximum frequency of 217MHz. Keywords– Discrete Fractional Fourier Transform, Hardware Architecture, CORDIC and FPGA.
I. INTRODUCTION
ractional Fourier transform [1], [2],[3] has been an emerging mathematical tool, having wide area of
signal [4], Image processing applications like Biomedical signal detection[6], Image registration[7], Image Encryption[5], Security of registration data of fingerprint image[8], Broadband beam forming of LFM signals[9] and Moving target detection and location in space borne SAR.
Unlike Discrete Fourier Transform (DFT), Discrete Fractional Fourier Transform (DFrFT) has many definitions, such as direct form, improved sampling-type, linear combination-type, eigenvectors decomposition-type [10], group theory-type and impulse train-type DFrFT. Among these definitions, eigen vector decomposition type is to be a legitimate definition [11] to satisfy all the properties such as unitary, index additive, reduction to DFT when fractional value is one, approximation of continuous fractional Fourier transform.
To the knowledge of authors on the evolution of Fractional Fourier transform and its application, no hardware architecture is available except [12] for real time implementation of DFrFT. In our paper new hardware architecture for implementing DFrFT based on eigen vector decomposition have been proposed and implemented on FPGA device for real time applications.
The rest of this paper has been organized as fallows; Section II presents brief review on Fractional Fourier Transform and its discrete form. Section III describes the
proposed hardware architecture. Results and discussion of this proposed implementation has been highlighted in section IV. Finally, section V concludes the paper with future scope of this work.
II. FRACTIONAL FOURIER TRANSFORM A. Continuous Fractional Fourier Transform.
The generalized Fourier transform rotates the signal f(u) in time-frequency plane [1] on the rotation angle of
(‘a’ is fractional value) and is given in fallowing equation Here ‘v’ is the variable in ath order fractional domain and ‘u’ is variable in fractional domain in order of zero. The kernel Kα(u,v) is decomposed as given in equation (2) in terms of Hermite-Gaussian function [2] which are eigen functions of the Fourier transform. The decomposed kernel is ψk(u) is the kth order Hermite-Gaussian function, Hk is the kth order Hermite polynomial. B. Discrete Fractional Fourier Transform
The discrete fractional Fourier transform has been proposed in [10] using discrete Hermite-Gaussian functions, for N-point as given in equation (3).
Where uk[n] is kth discrete Hermite-Gaussian function. The discrete values of continuous Hermite-Gaussian function ψk(v)are approximated by using eigen vectors of commuting matrix S in [10]. The N point DFrFT Matrix for rotation angle α is defined [3] as
F
2aπα=
21/4 -πu2 ψk(u)= Hk (√2πu) e√2kk! and
uk[m]e-jαk uk[n] Fα[m,n] = ΣN-1
k=0
(3)
∞
-∞fα(v) = f(u) Kα(u,v) du ∫ (1)
Kα (u,v) = Σ∞
k=0
ψk(v)e-jαk ψk(u) (2)
e
δ(u–v) δ(u+v)
whereKα(u,v) =
if α is not a multiple of π
if α is a multiple of 2π if α+π is a multiple of 2π
2π1- j cot α√ j cot α – j u v cscα u2+v2
2
= U E UT
Where U is discrete Hermite-Gaussian matrix consists of discrete Hermite-Gaussian functions as in the fallowing equation and ‘E’ is a diagonal matrix which contains the eigen values e-j0α, e-j1α, e-j2α,..... e-j(N-2)α, e-jMα of DFrFT matrix Fα as diagonal elements.
The response of an N-point DFrFT ‘fα[n]’, for N input samples f[n] with rotation angle α can be calculated by fα[n]= Fα f[n]. i.e. fαN×1=UN×N*(EN×N*(UT
N×N*fN×1)). Here * indicates matrix multiplication operation. For the proposed architecture the matrix E is replaced with a column matrix C that contains the Eigen values of DFrFT for given input angle α and middle matrix multiplication is replaced by an array multiplication. The modified expression is fαN×1=UN×N*(CN×1×(UT
N×N*fN×1)), Where‘×’ indicates the array multiplication operation.
III. PROPOSAL OF DFRFT ARCHITECTURE
The proposed architecture is composed of three levels. The input data to be process is flow through all the three serially connected levels as shown in Fig.1.
The level-I performs two mathematical operations,
one is calculation of eigen values for given input rotation angle and another is calculation of the response of matrix UT for input samples f. these two operations are carried out by two blocks of level-I named as C and U1. This level passes two computed results that are matrix C and UT*f to the level-II, which execute the multiplication of eigen values with the response of U1 block and feeds the product C×UT*f to the level-III. In this level we get the rotated input samples fα = U*C×UT*f as an output, by the act of matrix multiplication between level-III input and Hermite-Gaussian matrix U.
If input samples are complex values (f=a+jb), we have to calculate the response of U1 block separately for both real and imaginary parts, so that we need two U1 blocks. Similarly for any type of input samples f, two U2 blocks are required to process Level-III real and imaginary inputs separately. For this reason in Fig.1 the
blocks U1 and U2 are denoted as multi-blocks. In Fig. 4 the data flow between these blocks is given in detail. The time period between two successive input samples f and the time period between two successive output results fα are same. The rest of this section presents the detail description of each level of proposed architecture. Level-I:
In an N-point DFrFT, this level-I is partitioned into two parts. The first part performs the calculation of eigen values for given rotation angle (α) using a block named as C in the architecture as shown in Fig.2. This block receives an angle for every N clock cycles and it computes corresponding N complex conjugated eigen values. The results of block C for given angle α are ej0α, ej1α, ej2α,….ej(N-2)α, ejMα, where M=N-1, for N odd and M=N, for N even.
The architecture for calculation of eigen values
requires two clocks, i.e. clock1 (Clk) having the frequency same as sampling frequency and another clock2 (Clkn) having 1/Nth of frequency of clock1. With active high enable signal, the counter counts in sequence …0, 1, 2,…N-2, M, 0, 1, 2... . This counter output is connected to a multiplier which took rotation angle as another input through a register ‘R1’ that receives clock2. The results of multiplier 0, α, 2α… (N-2)α, Mα; M=N-1 for N odd, M=N for N even are fed to the pipelined CORDIC (CO-ordinate Rotation DIgital Computer) by another register ‘R2’.
The CORDIC [15] calculates the cosine and sine values of its input angles, which are real and imaginary parts of complex conjugated eigen values for given rotation angle. The real and imaginary parts of computed results pass to the output real part port and output imaginary part port respectively through a set of registers as shown in fig.2. The requirement of these registers has been presented at the end of Level-I explanation.
The block ‘U1’ of second part of level-I multiplies input values f with the matrix UT. This part consist of a mod-N counter, ‘N’ number of ROMs with N address locations per each ROM, N Multipliers, N accumulators, one N to1 Multiplexer and set of buffers. The data flow in this part is shown in Fig.3. As in block ‘C’ this ‘U1’block also operates with two clocks named clock1 (clk) and clock2 (clkn). The N rows of the matrix UT are stored in N ROMs. The arrangement of rows of matrix UT in ROM is shown in Table-I.
(4)
k ≠ N, for N odd k ≠ N-1, for N even
Fα = uk[n] e-jαk uTk [n] Σ
N
k=0
(5)U=
u0[1] u1[1] . . uN-2[1] uM[1] u0[2] u1[2] . . uN-2[2] uM[2] . . . . . . . . . . . . u0[N] u1[N] . . uN-2[N] uM[N]
here M=N-1; for N odd M=N; for N even
Fig.1: Block diagram of the DFrFT
Level-II
Level-I
Level-III
Rotated input ‘fα’
E
Input ‘f ’ Rotation angle (α)
C U1
U2
CounterCounts (0 – N-1); If N Odd
Counts (0 – N-2, N); If N Even Rotation angle (α)
C.E
R1
Clkn Clk
Clk
Fig. 2: Calculation of Eigen values.
Output (Real Part)
Output (Imaginary Part)
Enable
C
*R2
Imaginary Part
Real Part
Clk
Pipelined CORDIC (Calculates Sin & Cos Values)
R31 R41 RN1
ClkClk Clk
R32 R42 RN2
ClkClk Clk
Clkn
Clk
TABLE-I ARRANGEMENT OF THE ELEMENTS OF MATRIX “UT” IN ROMS
Address Location
ROM 1
ROM 2
. . ROM N-1
ROM N
0 UTR+1,1 UT
R+2,1 . . UTR-1,1 UT
R,1 1 UT
R+1,2 UT R+2,2 . . UT
R-1,2 UTR,2
: : : .. : : N-1 UT
R+1,N UT R+2,N . . UT
R-1,N UTR,N
UTk,l – Indicates the element belongs to
kth row and lth column of UT Matrix, R is the value of N/2 that’s Rounded towards Zero
The ROMs are accessed with a ring counter with active high enable signal as shown in Fig.3.
All the data of corresponding address locations of N ROMs is proceed to N multipliers with sampled input f[n] as another input. At every clock1 cycle, all the N multipliers multiplies sampled input with output values of the corresponding N ROMs, and forwards these results to their N accumulators through registers as shown in Fig.3. Each of these N accumulators performs addition operation between its input and output values on every clock1 cycle. When all these N accumulators adds their N set of inputs, these accumulators sends the resultant data to next stage and clears the accumulators to add N set of fresh inputs. The N accumulator outputs passes through the Nto1 multiplexer to set of registers that are operate with clock1 (clk). The multiplexer selection line is connected to counter output. The multiplexer inputs are connected to the N accumulator outputs in such a way that the 1st, 2nd, 3rd….Nth valid output values of the Nto1 multiplexer should be the 1st, 2nd, 3rd….Nth accumulator output values.
In level-II we have to execute a mathematical operation in between the outputs of block C and block U1. So that it is necessary to forward the computed results of block C and block U1 at the same time to level-II, but the latency of block C varies with the number of pipelines used in CORDIC and the latency of block U1 depends upon the value of N. In order to
maintain same latency for both the blocks, it is necessary to insert a set of registers either in block C or block U1. The number of registers is depends upon the values of N and Ci, where Ci is number of pipelines used in CORDIC. If N>Ci–1, then the N+1–Ci number of registers have to add in block C, addition of register set in block U1 is not required and the latency is L=N+3. If N<Ci–1, then the Ci–(N+1) number of registers have to add at the output of multiplexer in block U1, addition of register set in block C is not required and the latency is L= Ci+2. If N=Ci+1, then the latency of both the blocks is same, register set is not required in both C, U1 blocks and latency is L=N+3=Ci+2. The data flow from this block to next blocks is shown in Fig.4.
Level-II:
The Level-II has a complex multiplier followed by two serial in parallel out shift registers and a set of 2N Registers. This level receives the real, imaginary parts of complex conjugated eigen values form the block C through its Real2(R2), Imag.2(I2) ports respectively and the response of block U1 for input samples ‘f’ is received by its another two input ports Real1(R1), Imag.1(I1). The Block diagram is shown in fig.5.
The complex multiplier of this block is different
from the ordinary complex numbers multiplier. This complex multiplier performs the multiplication between the Eigen values and the results of block U1 by taking the complex conjugate of Eigen values and the results of block U1 as inputs. For every clock cycle the complex
Clk
n
Fig. 3: The Data flow for part II of level-I
Accumulator2 AccumulatorN
Cou
nter
R22
Clk
Accumulator1
R5
R4
Clk
Clk
Clr
Clk
Counter Out
*
U1
Enab
le
ROM
2
0 1 2 .
. N-1
ROM
1
0
1 2 .
. N-1
R21
*Clk Clk Clk
Clr Clk Clk
Clr
ROM
N
01 2 .
. N-1
*R2N
R31 R32 Clkn Clkn Clkn R3N
R1
N to 1 MUX
Clk
C.E
R(C
i+1)
Clk
f
f(n)
*UT
Clk
‘U1’
Ena
ble
Clkn
Fig. 4: Data flow Diagram of DFRFT
Real (fα[n]) Imag. (fα[n])
‘U2’ for Real Part
‘U2’ for Imag. Part
E Count r1 r2 rN i1 i2 iN Count
‘C’
Imag
.2
Clk
Enab
le
Rotation Angle α
Real
2
Imag.(f) ‘U1’
Clk
Real
1
Counter out
Counter out
Real(f)
Ena
ble
Clk
Imag
.1
Clkn Clkn
Real1(R1) Real2(R2) Imag.2(I2) Imag.1(I1)
R3 R4
+
Clk
Clk
Clkn
R2 R1 Clk
Clkn
Out1 Out2 E
Clk
Real Part Output Imaginary Part Output
‘2N’ Number of Registers
Fig 5: Complex Multiplier with shifting operation
Clk
–
* * * *
Serial in Parallel out Shift Register
Cl k Cl k Serial in Parallel out Shift Register
multiplier multiplies a new pair of complex numbers. The outputs of complex multiplier out1, out2 release the results of mathematical computations (R1×R2) + (I1×I2), (R2×I1) – (R1×I2) respectively.
The two resultant outputs, one is real part and another imaginary part are connected to two serial in parallel out shift registers. The number of registers required for each shift register is N-1. For every N-1 clock cycles the complex multiplier passes the N-1 results to this serial in parallel out shift register. The shift register fallowed by a set of 2N registers. The first and (N+1)th register are connected to real and imaginary outputs of complex multiplier respectively. Remaining 2 to N and N+2 to 2N registers are connected to the N-1outputs of first shift register (corresponding to out1) and N-1 outputs of second shift register (corresponding to out2) respectively as shown in Fig.5. But these registers operate with the clock2, unlike the shift registers, which operate by the clock1. Level-III:
This level-III performs another matrix multiplication operation on the outputs of level II. The signal flow graph is shown in Fig.6.
This level has N ROMs, each ROM stores a column
of matrix U of size N×N. Because of accessing all ROMs using the counter output of block U1, to maintain the synchronous between ROMs output values and input values of multiplier the arrangement of matrix elements in ROMs is as fallows, the data of address 0 of all ROMs contain the rth row of matrix U. where r is the remainder of (N+L)/N. The address1 of N ROMs stores the next row of the matrix, and remaining locations of N ROMs fallows the same sequence. By fallowing this sequence the (r-1)th memory location stores the first row of the matrix. When counter counts k, all the data in kth
memory locations of N ROMs multiplies with output values of level-II as shown in the Fig.6. This N resultant multiplier outputs are added by using N-1 adders and send out as rotated input samples in time-frequency plane with given angle α.
IV. RESULT AND DISCUSSION
The proposed architecture discussed in the previous
section had been designed using verilogHDL for the order of N equal to four. The design has been simulated using Xilinx simulator with random input samples f(n) = [11+3i, 9+2i, 7+4i, 8+2i] as test vector. For the sake of simplicity and to realize the outputs of the design, the integer values for the inputs have been chosen which are representing with five bits (one bit for sign and four bits for integer value). The internal precession of each block has been chosen according to avoid maximum truncation error. Finally the outputs are given in 16-bit format (where one bit for sign, four bits for integer and eleven bits for fractional value). Similarly for the fractional value α, the format has been chosen with binary weightage as [-π . . . . ]. In this case b=16. The hardware complexity of the proposed design for the Nth order of DFrFT has been summarized in Table-II. This design is based on pipelined approach; hence the design requires latency period L+N+1, where L is Latency of the CORDIC.
TABLE-II
HARDWARE REQUIREMENT FOR N-POINT DFRFT
Component Name Number of Components N×16NbitROM 2 Multipliers 4N+5 Adders/Subtractors 4N+ Adders in CORDIC N to 1 Multiplexers 2 Counters 2 Registers 10N+6+Ci+2×(|N+1-Ci|)
The simulation output of ISE 10.1i has been
presented in Table-III, which shows that the verilogHDL simulation results of proposed design are close to MATLAB simulation outputs. The simulated output with timing has been shown in Fig.7. This shows that the proposed architecture takes latencies of 19 clock cycles (14 clock cycles for Level-I and 5 clock cycles for both Level-II and Level-III discussed in previous section). Finally the proposed design has been synthesized using Xilinx XST tool, targeting a FPGA device (XLV5LX110T) [15]. The synthesis results obtained for hardware has been presented in Table-IV.
U2
Input 1
Count Input
Input 2 Input N
Data 1 Data 2 Data N
Address
Adder
fα Fig 6: Signal Flow graph for level-III
ROM
1
0 1 2 .
. N-1
ROM
2
01 2 .
. N-1
ROM
N
01 2 .
. N-1
* * *
Address Address
π 2b-1
π 23
π 22
π 21
Fig.7: The Simulation Results of proposed DFRFT architecture using ‘Xilinx ISE’ Simulator
TABLE-III COMPARISON OF MATLAB AND XILINX-ISE SIMULATION RESULTS
MATLAB
Simulation Results Decimal values
Xilinx-ISE Simulation Results of Proposed Architecture
Decimal (Hexadecimal ) values 10.5406+4.2159i 10.7929+3.6176i (5658+1CF1i) 9.0500+2.2435i 9.0234+2.1074i (4830+10DCi) 7.0371+3.6100i 7.0151+3.8052i (381F+1E71i) 8.0513+2.1945i 8.0234+2.1074i (4030+10DCi)
TABLE-IV
HDL SYNTHESIS REPORT- MACRO STATISTICS
Component Name Number of Components 4×64bitROM 2 Multipliers 21 Adders/Subtractors 56 4 to 1 Multiplexers 2 Counters 2 Registers 71 Accumulators 9
The synthesis report in this table shows that the
synthesis results for hardware requirement are approximately same as the theoretical results. Timing report of this implementation shows that the proposed design can be operated at maximum frequency of 217MHz. the proposed architecture in this paper has been compared with the architecture presented in[12] for N=1024. The comparison for hardware and timing has been highlighted in Table-V.
TABLE-V
COMPARISON OF PROPOSED ARCHITECTURE WITH [12] FOR 1024-POINT DFRFT
Hardware requirement
Component Name Number of Components
Architecture in [12]
Proposed Architecture
Multipliers 1048576 4101 Adders/ Subtactors 1048576 4144 Registers 5242880 12280
Multiplexers 3072 (2:1 Mux) 3072 (4:1 Mux) 2 (1024:1 Mux)
Counters Not Mentioned 2 Timing details
Maximum speed 99.58 MHz 217.39 MHz Sampling frequency 33.00 MHz 217.39MHz This shows that the proposed design in this paper is
better in terms of hardware complexity and timing compared to architecture presented in [12].
V. CONCLUSION
In this paper, new hardware architecture for
computing DFrFT has been proposed. This architecture has been described using verilogHDL, synthesized and implemented on targeted FPGA device (XLV5LX110T). The simulation results are verified with MATLAB and the implementation results are compared with theoretical
results and also compared with existing architecture presented in [12]. The implementation results shows that the proposed design is suitable to most of signal, image processing and communication systems. The proposed architecture and its implementation is fixed in terms of transform length order N. i.e. N is fixed which constraints to specific applications. Flexibility of architecture is required to meet the demand of all applications. In this context, authors of this paper have been working for designing a unified architecture suitable for all applications.
REFERENCES
[1] L. B. Almedia, “The Fractional Fourier Transform and
Time-Frequency Representations”, IEEE Trans. On Sig. Process., vol.42, pp. 3084-3090, November 1994.
[2] V. Namias, “The Fractional Order Fourier Transform and its Application to Quantum Mechanics”, inst. Math. Appl., vol.25, pp. 241-265, August 1980.
[3] S. C. Pei, C. C. Tseng, M. H. Yeh, and J. J. Shyu, “Discrete fractional Hartley and Fourier transforms,” IEEE Trans. Circuits Syst. II, vol. 45, pp. 665–675, 1998.
[4] H. M. Ozaktas, B. Barshan, D. Mendlovic, L. Onural, “Convolution, filtering, and multiplexing in fractional fourier domains and their relation to chirp and wavelet transform”, J. Opt. Soc. Am. A, vol. 11, pp. 547-559, February 1994.
[5] N. Zhou, T. Dong, “Optical image encryption scheme based on multiple parameter random fractional Fourier transform”, 2009 Second Int. Symposium On electronic commerce and security, pp. 48-51, 2009.
[6] Y. Zhang, Q. Zhang, Shaohua Wu, “Biomedical signal detection based on Fractional Fourier Transform”, IEEE, ITAB 2008, pp.349 – 352, May 2008.
[7] W. Pan, K. Qin, Y. Chen, “An Adaptable-Multilayer Fractional Fourier Transform Approach for Image Registration” IEEE Trans. on pattern analysis and machine intelligence, vol 31, March 2009.
[8] R. IWAI, H.Yoshimura, ”Security of registration data of fingerprint image with a server by use of the fractional Fourier transform”, IEEE, ICSP2008 Proceedings, pp.2070-2073, 2008.
[9] WU. Hai-zhai, Tao ran, ” Broadband Beamforming of LFM signal based on Fractional Fourier Transform”, ICSP2008 Proceedings, pp.296-298., 2008.
[10] C.Candan, M.A.Kutay, H.M.Ozaktas, “ The Discrete Fractional Fourier Transform”, IEEE Trans. on sig. process., vol. 48, pp. 1329-1337, May 2000.
[11] T. Ran, Z. Feng & W. Yue, “ Research progress on discretization of fractional Fourier transform”, Springer, Sci. China Ser F-Inf Sci., pp. 859-880, July 2008
[12] P. Sinha, S. Sarkar, A. Sinha, D. Basu, “ Architecture of a configurable Centered Discrete Fractional Fourier Transform Processor” IEEE Circuits and Systems, MWSCAS 2007. 50th Midwest Symposium, pp.329-332, 2007.
[13] S. C. Pei, W.L. Hsue, J.J.Ding, “Discrete Fractional Fourier Transform Based on New Nearly tridiagonal commuting Matrices”, IEEE Trans. on Signal processing, vol.54, pp. 3815-3828. October 2006.
[14] K.C. Ray and A.S. Dhar, “CORDIC-based unified VLSI architecture for implementing window functions for real time spectral analysis”, IEE Proc.-Circuits Devices Syst., Vol. 153, pp. 539-544 , December 2006.
[15] Xilinx, “Virtex-5 FPGA User Guide”, UG190 (v4.7) May 1, 2009.