VLSI Implementation of a High-Speed Iterative …luethi/study/publications/ISCAS07.pdfVLSI Implementation of a High-Speed Iterative Sorted MMSE QR Decomposition P. Luethi, A. Burg,

VLSI Implementation of a High-SpeedIterative Sorted MMSE QR Decomposition

P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber and W. FichtnerIntegrated Systems Laboratory, ETH Zurich, Switzerland{luethi,apburg,haene,perels,felber,fw}@iis.ee.ethz.ch

Abstract— The QR decomposition is an important, but oftenunderestimated prerequisite for pseudo- or non-linear detectionmethods such as successive interference cancellation or sphere de-coding for multiple-input multiple-output (MIMO) systems. Theability of concurrent iterative sorting during the QR decomposi-tion introduces a moderate overall latency, but provides the basefor an improved layered stream decoding. This paper describesthe architecture and results of the first VLSI implementation ofan iterative sorted QR decomposition preprocessor for MIMOreceivers. The presented architecture performs MIMO channelpreprocessing using Givens rotations in order to compute theminimum mean squared error QR decomposition.

I. INTRODUCTION

Multiple-input multiple-output (MIMO) is considered tobe one of the key technologies for enabling next-generationhigh-speed wireless communication. MIMO systems employmultiple antennas at both the transmitter and the receiverin order to increase data rate and link reliability. In spatialmultiplexing mode, MIMO systems reach higher throughputsby transmitting multiple data streams in parallel in the samefrequency band without additional expense in bandwidth ortransmit power. Consequently, numerous upcoming wirelesscommunication standards such as IEEE 802.11n or IEEE802.16 will employ MIMO technology. Unfortunately, the con-siderable throughput improvements enabled by MIMO systemsentail a significant increase in signal processing complexity,especially for separating the multiple, parallel data streams.

QR decomposition (QRD) is one of the key algorithmsemployed in MIMO systems, since numerous MIMO detectionmethods require QR decomposition of the channel matrix asstarting point. The application of QRD ranges from linearMIMO detection methods to ordered successive interferencecancellation (OSIC) and to tree-search algorithms [1], [2]with close-to maximum likelihood bit error rate (BER) perfor-mance. Sorted QR decomposition (SQRD) in conjunction withsuccessive interference cancellation (SIC) provides a consid-erable improved BER performance compared to unsorted SIC,while consuming only little additional hardware resources.

Contribution: In this work, we present - to the best of ourknowledge - the first VLSI implementation of a 4×4 ma-trix preprocessor performing iterative sorted minimum meansquared error (MMSE) QR decomposition for OSIC or fortree-search algorithms. The described VLSI architecture in-corporates optimized fixed-point arithmetic and shows how acombination of CORDIC circuits and complex-valued multi-pliers allows to achieve a very high throughput with low silicon

area. The implemented reference design in a 0.25 µm tech-nology processes 1.56 million complex-valued 4×4 channelmatrices per second maintaining close-to floating-point BERperformance for OSIC detection up to a signal to noise ratio(SNR) of 40 dB.

Outline: The remainder of this section introduces the sys-tem model and motivates the development of a high-speed QRdecomposition circuit. Sec. II introduces the sorted MMSEQR decomposition algorithm under consideration. Sec. IIIdescribes the high-level VLSI architecture of the SQRD unitand the corresponding micro-architectural choices for the im-plementation of the arithmetic units. Complexity/performancetrade-offs associated with different sorting strategies and fixed-point considerations are discussed in Sec. IV and implemen-tation results are presented in Sec. V.

Notation: Bold uppercase and lowercase letters representmatrices and column vectors, respectively. The ith columnvector of H is denoted by hi, and Hi,j stands for the elementin row i and column j of H.

A. System ModelThe system under consideration is a MIMO system with MT

transmit and MR receive antennas. The matrix H describes theMIMO channel, the MT × 1 transmit signal vector is denotedby s = [s1 . . . sMT

]T , and the vector n represents the additivezero-mean i.i.d. Gaussian noise with variance σ2

n per complexdimension. The transmitted vector symbol is normalized suchthat E{ssH} = IMT

. The corresponding receive signal vectory = [y1 . . . yMR

]T is given by

y = Hs + n (1)

and the signal to noise ratio is defined as SNR = 1/σ2n.

B. MIMO Detection based on QR DecompositionThe task of the MIMO receiver is to recover s from y,

assuming knowledge of the channel H. To this end, manycomputationally efficient MIMO detection algorithms startby decomposing H into a unitary matrix Q and an upper-triangular matrix R using QR decomposition such that H =QR. The transformation y = QHy then transforms theproblem in (1) into

y = Rs + n, (2)

which is fully equivalent, but can be solved with significantlyreduced computational complexity (for example through back-substitution or sphere decoding).

The MMSE QR decomposition is a slight modification tothe QR decomposition of H required for example for zero-forcing linear or OSIC detection. The basic idea is to takethe additive noise into account by considering an augmentedchannel matrix H =

[HT σnI

]Tto obtain Q = Qa and

R such that [Qa Qc

Qb Qd

] [R0

]=

[HσnI

]. (3)

The QR decomposition of the augmented channel matrixhas a computational complexity that is roughly 50% highercompared to the QR decomposition of H [3]. However, thealgorithm achieves an improvement in BER performance withlinear detection and a significant complexity reduction withtree-search based MIMO detection algorithms.

II. SORTED MMSE QR DECOMPOSITION ALGORITHM

The MMSE QR decomposition of H can be performedthrough the Gram-Schmidt [4] orthogonalization or througha sequence of unitary transformations. The main advantageof unitary transformations resides in the fact that one canemploy vector rotations as atomic operations which preservethe total power of the operands. Hence, the dynamic rangeof all variables is tightly bounded and the algorithm is wellsuited for fixed-point arithmetic.

To simplify the notation, the composite matrix

Z(0) =[

H IMR

σnIMT0

](4)

is introduced. Each Givens rotation, described by the matrixΘi, is designed to selectively zero a single entry of Z(i). Inorder to upper-triangularize the left half of Z(0), the iterationZ(i) = ΘiZ(i−1) is applied, which ultimately yields

Z(N) = ΘN . . .Θ1Z(0) =[

R QHa

0 QHc

], (5)

where the nulling proceeds first row-by-row and then column-by-column as illustrated in Fig. 1 for i = 1, 2, . . . , N . Therelevant parts of Z(N) for further MIMO processing are Rand QH

a . The sorting follows the original SQRD algorithmproposed in in [5]. However, while the original description wasbased on a modified Gram-Schmidt procedure, it is adaptedin Alg. 1 to be used with Givens rotations.

III. VLSI ARCHITECTURE

The application field of the circuit developed in this paperare MIMO-OFDM systems where SQRD must be performedon a large number of channel matrices in a short time [6].The subsequent architectural considerations target thereforethe high-speed region of the design-space.

A. Resource Allocation and SchedulingFor the design of a suitable high-level VLSI architecture,

we start by identifying the two types of atomic operationsrequired for Givens rotations, in order to finally constitute theQR decomposition algorithm described in the previous section:Vectoring subsumes the computation of Θi and the associatednulling of the corresponding entry in Z(i). Rotation refers to

Algorithm 1 MMSE-SQRD based on Givens rotations

1: Z = Z(0), p = [1 . . .MT ]2: for j = 1, . . . ,MT do3: ξj = ‖hj‖2 initial column norms4: end for5: for i = 1, . . . ,MT do6: k = arg minj=i,...,MT

ξj

7: exchange columns i and k in order array p and in thefirst MR + i− 1 rows of Z

8: compute a series of Givens rotations Θu such that rowsi + MR, . . . , i + 1 of column zi become zero:Z =

(∏iMR

u=(i−1)MR+1 Θu

)Z

9: for j = i, . . . , MT do10: ξj = ξj − ‖Zi,j‖2 update column norms11: end for12: end for

the application of Θi to an individual column of Z(i), in whichonly two entries are affected by the transformation.

As can be seen from the illustration in Fig. 1a, each vec-toring operation is followed by multiple rotation operations.Hence, the number of vector rotations exceeds by far thenumber of vectoring operations, even if explicit rotation isavoided when the affected entries of Z(i) are a-priori knownto be zero. Since a VLSI architecture for high-speed ma-trix processing needs super-scalar execution units, dedicatedhardware resources are allocated for vectoring and rotation,allowing both operations to be carried out concurrently asillustrated in Fig. 1b. The rotation circuit is optimized forspeed since the number of operations to be carried out is large.The vectoring circuit needs to perform fewer operations, it cantherefore be designed to require less silicon area by applyingiterative decomposition. These dedicated VLSI optimizationsdo not affect the overall throughput of the QR decompositionunit, because the total processing time for the area-optimizedvectoring operations can be hidden behind the large numberof rotation operations.

C C C 0 0 1

C C C 0 1 0

C C C 1 0 0

R 0 0 0 0 0

0 R 0 0 0 0

0 0 R 0 0 0

R C C C C C

0 R C C C C

0 0 C C C C

0 0 R C C 0

0 0 0 C C 0

0 0 0 C 0 0

cycles

i = 1 i = N

Vectoring

Rotation

i = 1

C C C 0 0 1

C C C 0 1 0

R C C C 0 0

0 C C C 0 0

0 R 0 0 0 0

0 0 R 0 0 0

· · ·

· · ·

· · ·

R C C C C C

0 R C C C C

0 0 R C C C

0 0 0 C C C

0 0 0 C C 0

0 0 0 C 0 0

completei = 2a)

b)

i = 2 i = 3 i = N

Fig. 1. a) Illustration of the MMSE QR decomposition sequence. The initialmatrix Z(0) is shown at the left, the final result Z(N) of the decompositionat the right. b) Parallel processing of vectoring and rotation in the proposedVLSI architecture.

B. Implementation of Givens Rotations

The two basic operations for Givens rotations, vectoringand rotation, can both be implemented using conventionalarithmetic or dedicated CORDIC circuits [7]. CORDICs are

a well-established method to implement Givens rotations inhardware. In short, the concept of the CORDIC algorithm isto decompose the rotation of a vector into a series of microrotations by applying shift and add operations. This sequenceof shift and add operations is first determined by the vector-ing block, and afterwards executed similarly by the rotationblock. A more detailed analysis shows that CORDICs areparticularly well suited for the area-efficient implementationof vectoring using iterative decomposition, while fast rotationcan be realized more efficient by using conventional complex-valued multipliers [3], but this implies the availability of thecorresponding complex-valued rotation coefficients. A solutionto this problem is to attach an area-optimized slave CORDICin rotation mode to the vectoring CORDIC as shown in Fig. 2.The input to this slave CORDIC is a unit vector, prescaled bythe CORDICs constant scaling factor κ. The correspondingoutput values are the coefficients required for the multiplierswhich carry out the vector rotation.

MasterCORDIC

SlaveCORDIC

0

vectoring only rotation only

directionlog

cos(x) sin(x)

κx y

x’

Fig. 2. Enhanced vectoring CORDIC, computes directly cos(x) and sin(x)for subsequent vector rotations using standard multiplications.

C. High-level ArchitectureThe overall VLSI architecture of the QR decomposition

unit is shown in Fig. 3. The dedicated vectoring and ro-tation circuits (using CORDIC and conventional arithmetic,respectively) are extended to handle complex-valued matrixentries. The memory which stores the original and intermediatematrices Z(i) is shown as QR Cache and is realized usingRAMs with a dedicated read and write port. To satisfy thehigh memory bandwidth requirements of the rotation block(two read and two write accesses per cycle), the cache issplit into two independent memory banks. One bank holdsthe even rows, the other holds the odd rows of Z(i). Since therotation block always requires the full memory bandwidth, thevectoring block is fed by a separate FIFO and an additionalshadow memory. This solution prevents the rotation blockfrom being stalled by memory access conflicts.

D. Iterative SortingThe iterative sorting described in Alg. 1 is a key feature of

this circuit. The sort metrics are the column norms of Z(i),which are completely calculated for Z(0) at the beginningof Alg. 1 in line 3, and which are then iteratively updatedfor Z(i), i ≥ 1 in line 10. The recursive column-norm updateprocedure requires few additional hardware resources1 and re-ordering of the columns of Z(i) can be implemented efficientlywith simple address remappers shown in Fig. 3. However,the sorting occasionally hinders the parallel processing of

1In order to reduce hardware complexity, the squared `2-norm in Alg. 1can be approximated for example by the `1-norm or by the `∞-norm.

Read Address Remapper Write Address Remapper

ComplexVectoringCORDIC

σn

readaddress

C C C

C CR

0

0

Load/QRControl

WritebackControl

writeaddress

Iterative Ordering(column remapping)

diagonalelements

ComplexVector

Rotation

memorybypass

NormComputation

e−jφ

e−jθ

QR Cache (H,I / R,Q)(separate read and write ports for even and odd rows)

read port write port

ThetaVectoring

PhiVectoring

ShadowMemory

ThetaRotation

PhiRotation

PhiRotation

Fig. 3. VLSI architecture of the low latency iterative sorted MMSE QRdecomposition with super-scalar vectoring and vector rotation.

vectoring and rotation. The problem arises when the update ofone column norm needs to wait for the completion of outstand-ing rotation operations. In this case, also the next vectoringoperation needs to be delayed, until the next column to beprocessed can be identified based on the updated norms. Asa consequence, iterative sorting introduces an additional delaycompared to QR decomposition with a fixed column order.Speculative vectoring of the first element in a new columnof Z(i) helps to reduce the associated performance penalty.However, as can be seen from the comparison in Tbl. I, asmall increase in the number of cycles compared to unsortedor one-time (a-priori) sorted QR decomposition remains.

TABLE IPROCESSING TIME OF DIFFERENT MODES

Mode Processing timeper 4×4 matrix

unsorted MMSE-QRD 67 cycles / 536 nsone-time sorted MMSE-QRD 67 cycles / 536 nsiterative sorted MMSE-QRD 80 cycles / 640 ns

IV. IMPLEMENTATION TRADE-OFFS

Implementation trade-offs comprise the sorting strategy andthe choice of the fixed-point parameters. For the subsequentanalysis, consider a MIMO system with MT = MR = 4,16-QAM modulation and MMSE-(O)SIC detection.

A. Impact of the Sorting Strategy

The BER simulation results for different sorting strategiesare shown in Fig. 4. Clearly, a simple one-time sorting asused in [6] already provides a significant performance gaincompared to unordered SIC. The iteratively sorted QR de-composition closes the gap between the highly complex, butoptimal V-BLAST [8] ordering and the very simple, but lesseffective one-time sorting.

24 26 28 30 32 34 36 38 4010-4

10-3

10-2

10-1

SNR [dB]

BE

R

MMSE-SIC unsortedMMSE-SIC one-time sortedMMSE-VBLASTMMSE-SQRD SIC floating-pointMMSE-SQRD SIC HW [9 6]MMSE-SQRD SIC HW [10 7]MMSE-SQRD SIC HW [11 8]MMSE-SQRD SIC HW [12 9]MMSE-SQRD SIC HW [13 10]

Fig. 4. BER performance of different MIMO detection methods for uncoded16-QAM, MT = MR = 4, and perfectly known H and σn ([x y] denotestotal number of bits including the sign bit, and fractional bits, respectively)

B. Fixed-Point Considerations

A critical aspect for the efficient implementation of theQR decomposition unit is a conservative choice of the fixed-point parameters. The ultimate performance measure is theimplementation loss which relates the BER performance ofa fixed-point receiver implementation to the BER achievedwith a corresponding floating-point receiver. Unfortunately,analytical expressions for this implementation loss as a func-tion of the number of integer and fractional bits used forthe intermediate variables and for the number of CORDICiterations used for the vectoring are not available. Hence, wemust resort to Monte-Carlo simulations. Corresponding resultsare shown in Fig. 4, where only the QR decomposition hasbeen implemented in fixed-point, while the detection stageperforming SIC has been implemented using floating-pointarithmetic to clearly separate the two units. Moreover, theinput matrix is assumed to be scaled such that the maximumabsolute value of real and imaginary parts does not exceedone (block floating-point).

V. IMPLEMENTATION RESULTS

The presented circuit has been implemented in a 0.25µm1P/5M CMOS technology. It supports all possible configura-tions deducted from MT ≤ MR ≤ 4. The sort mode caneither be disabled, or set to one-time or iterative sorting. Toachieve close to floating-point performance up to an SNRof 40 dB, an internal quantization setting of 3 integer and10 fractional bits has been chosen together with 9 CORDICiterations for vectoring. With iterative sorting based on the`2-norm, the corresponding design requires only 54k gates(2.1mm2, 0.25µm) and a suitable detection unit [6] wouldoccupy an additional 23k gates (0.9mm2, 0.25µm).

In comparison, the MMSE V-BLAST described in [9] hasa footprint of 9.0mm2 in a 0.35µm technology which corre-sponds to roughly 190k gates. A first reason for this significantarea penalty is the fact that the V-BLAST algorithm employstwo sequential sets of unitary transformations instead of one,which translates either into a twofold area or into a twofoldincrease in processing time. A second reason lies in the higher

Fig. 5. Layout of the iterative sorted QR decomposition ASIC in 0.25µm1P/5M CMOS technology: 125 MHz maximum clock frequency and finalcore area of 2.61 mm2 at 83% core utilization.

sensitivity of the V-BLAST algorithm to quantization effects,which ultimately calls for more complex arithmetic units.

VI. CONCLUSIONS

MMSE-SQRD is a key preprocessing step for many relevantMIMO detection algorithms, including successive interferencecancellation and sphere decoding. The key to an area ef-ficient, high-throughput VLSI architecture is the joint con-sideration of algorithmic and VLSI implementation aspects.The implemented MMSE-SQRD algorithm employs Givensrotations. The corresponding rotation matrices are obtainedwith CORDIC circuits and are applied through complex-valued multipliers. The iterative sorting adds only a smallpenalty in terms of silicon area and throughput, but providesa considerable BER performance improvement with OSICand potential for complexity reductions in numerous advancedMIMO detection schemes.

ACKNOWLEDGMENT

This work is supported by the STREP project No. IST-026905 (MASCOT) within the sixth framework program ofthe European Commission.

REFERENCES

[1] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, andH. Bolcskei, “VLSI implementation of MIMO detection using the spheredecoder algorithm,” IEEE Journal of Solid-State Circuits, 2005.

[2] L. G. Barbero and J. S. Thompson, “Performance analysis of a fixed-complexity sphere decoder in high-dimensional mimo systems,” in Proc.IEEE ICASSP, vol. 4, May 2006, pp. 557–560.

[3] A. Burg, VLSI Circuits for MIMO Communication Systems, Feb. 2006,Ph.D. dissertation, IIS/ETH-Zurich.

[4] G. H. Golub and C. F. Van Loan, Matrix Computations. John HopkinsUniv. Press, 1996.

[5] D. Wubben, J. Rinas, R. Bohnke, V. Kuhn, and K. Kammeyer, “Efficientalgorithm for detecting layered space-time codes,” in Proc. ITG Confer-ence on Source and Channel Coding, Jan. 2002, pp. 399–405.

[6] D. Perels, S. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, andH. Bolcskei, “ASIC implementation of a MIMO-OFDM transceiver for192 mbps WLANs,” in Proc. IEEE ESSCIRC, 2005, to appear.

[7] J. Volder, “The CORDIC trigonometric computing technique,” IRE Trans.Electronic Computers, vol. EC-8, no. 3, pp. 330–334, Sept. 1959.

[8] P. Wolniansky, G. Foschini, G. Golden, and R. Valenzuela, “VBLAST:An architecture for realizing very high data rates over the rich-scatteringwireless channel,” in Proc. IEEE ISSSE, Oct. 1998, pp. 295–300.

[9] Z. Guo and P. Nilsson, “A VLSI implementation of MIMO detection forfuture wireless communications,” in Proc. IEEE PIMRC, vol. 3, 2003,pp. 2852–2856.

Documents

VLSI Implementation of a High-Speed Iterative …luethi/study/publications/ISCAS07.pdfVLSI Implementation of a High-Speed Iterative Sorted MMSE QR Decomposition P. Luethi, A. Burg,