Sphere Decoder

7/29/2019 Sphere Decoder

1/5

272 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 60, NO. 5, MAY 2013

VLSI Implementation of a High-Throughput IterativeFixed-Complexity Sphere Decoder

Xi Chen, Guanghui He, Member, IEEE, and Jun Ma

AbstractBy exchanging soft information between themultiple-input multiple-output (MIMO) detector and the channeldecoder, an iterative receiver can significantly improve theperformance compared to the noniterative receiver. In thisbrief, a soft-input soft-output fixed-complexity-sphere-decodingalgorithm and its very large scale integration architecture areproposed for the iterative MIMO receiver. The deeply pipelinedarchitecture employs the optimized hybrid enumeration tosearch for the best child node estimate efficiently. By adding thecounterhypotheses in parallel with other candidates, the proposediterative MIMO detector improves the detection performancesignificantly with low detection latency. An iterative detector for

an44

64-quadrature amplitude modulation (QAM) MIMOsystem based on our proposed architecture is designed andimplemented using the 90-nm CMOS technology. The detectorcan achieve a maximum throughput of 2.2 Gbit/s with an areaefficiency of 3.96 Mbit/s/kGE, which is more efficient than otheriterative MIMO detectors.

Index TermsFixed-complexity sphere decoding (SD) (FSD),multiple-input multiple-output (MIMO), soft-input soft-output(SISO) MIMO detection, very large scale integration (VLSI).

I. INTRODUCTION

M ULTIPLE-input and multiple-output (MIMO) technol-ogy has been widely applied in wireless communica-tions since it offers significant increases in data throughput andlink range without additional bandwidth or increased transmitpower. By incorporating MIMO with bit-interleaved codedmodulation with iterative detection and decoding (BICM-IDD),the channel capacity can be approached [1] at the cost ofmuch higher complexity and lower throughput compared withnoniterative schemes. Thus, it is very important to develop ahigh-speed iterative detector to meet the increasing demandfor gigabit-per-second wireless systems such as the IEEE802.11ac wireless local area network (WLAN) and 3GPP LTE-Advanced.

Due to its practical importance, the very large scale integra-tion (VLSI) design of soft-input soft-output (SISO) detectors

has recently received a lot of attention. The first reported

Manuscript received August 26, 2012; revised November 19, 2012;accepted February 2, 2013. Date of publication March 27, 2013; date of currentversion May 13, 2013. This work was supported in part by the ResearchFund for the Doctoral Program of Higher Education of China under Grant20110073110055 and in part by the Shanghai Natural Science Foundationunder Grant 10ZR1416500. This brief was recommended by Associate EditorZ. Wang.

The authors are with the School of Microelectronics, Shanghai JiaoTong University, Shanghai 200240, China (e-mail: [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this brief are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSII.2013.2251954

implementation of a SISO MIMO detector is based on theminimum mean square error parallel interference cancellation(MMSE-PIC) algorithm [2], but it cannot fully exploit thespatial diversity provided by MIMO. To overcome this lim-itation, implementations of SISO single tree search (STS)sphere decoding (SD) [3], [4] are presented, which have max-log maximum a posteriori (MAP) performance. However, likeother depth-first tree-search algorithms, it suffers from variablethroughput and complexity depending on the signal-to-noiseratio (SNR). More recently, a novel SISO detection algorithmbased on trellis search and its VLSI architecture has been

proposed in [5], which provides a peak throughput of 1.7 Gbit/s,but it consumes large silicon area and is hard to support high-order modulation [e.g., 64 quadrature amplitude modulation(QAM)].

Fixed-complexity SD (FSD) is a breath-first tree-search al-gorithm previously proposed for hard-output MIMO detection.It is capable of providing near maximum likelihood (ML)detection performance with fixed and low complexity [6]. Ahighly efficient silicon implementation of FSD is reported in[7], which can achieve a 1.98-Gbit/s detecting throughput withthe parallel multistage VLSI architecture. It is very attractiveto extend the hard-output base architecture to support iterativeMIMO detection.

In this brief, we presentto the best of our knowledgethefirst VLSI architecture of list-based SISO FSD. Based on thearchitecture presented in [7], we propose an optimized hybridenumeration (HE) for iterative SISO FSD to find the best childestimate with low complexity. Meanwhile, candidates withcounterhypotheses are added by the bit flipping of the best childof the MAP estimate to improve the quality of generated softinformation. Implemented in a 90-nm CMOS technology, ourproposed architecture for a 4 4 64-QAM spatial multiplexingiterative MIMO detector achieves a constant throughput of2.2 Gbit/s per iteration independent of the SNR while main-taining near maxlogMAP detection performance.

II. SYSTEM MODEL

Consider a MIMO system based on the BICM-IDD scheme[1] with Nt transmit antennas and Nr receive antennas (Nr Nt). Assuming transmission over a flat fading channel, the

received symbol vector r can be written as r = Hs + n, whereH is an Nr Nt channel matrix and s is a Nt 1 transmitsymbol vector whose entries are taken from some set C ofM-QAM Gray mapped constellation points with M= 2Mc .The vector n is zero mean independent and identically dis-tributed Gaussian noise samples with variance N0 per complexentry.

1549-7747/$31.00 2013 IEEE


2/5

CHEN et al.: VLSI IM PL EM EN TAT IO N O F HI GH-TH RO UG HP UT IT ER AT IV E FI XE D-CO MP LE XI TY SP HE RE DE COD ER 273

In general, to avoid hardware-consuming operations inthe complex domain, the orthogonal version of real-valuedecomposition (ORVD) [7] can be adopted to transform theNr Nt complex system model into its equivalent 2Nr 2Nt real system represented as r = Hs + n. The ORVD alsotransforms the complex constellation C of M points to itsequivalent real constellation

Pof

M points. For the tree-

search algorithm, H is typically QR decomposed with H =QR, where Q is a unitary matrix and R is an upper trian-gular matrix. Then, the system model can be rewritten as y =QHr = Rs + QHn. As given in (1) and (2), metric incrementsMC(si) and MA(si) for channel-based and a priori-basedinformation, respectively, are summed up to a total incrementMP(si) = MC(si) + MA(si)

MC(si) =yi

2Ntj=i

Rijsj

2

= |yi Riisi|2 (1)

MA(si) = 12

Mc/2b=1

xi,bN0LAi,b (2)

where xi,b {+1,1} denotes the bth bit of the bit-levelvector xi associated with the ith level and L

Ai,b denotes the

a priori log-likelihood ratio (LLR) of xi,b. The sum of themetric increments along a path from the root node to node siyields the partial metric MP(s(i)) for a partial symbol vectors(i) = [si, . . . , s2Nt ]

T. The extrinsic LLR LEi,b of bit xi,b iscomputed as

LEi,b 1

N0

min

sP2Ntxi,b=1 MP(s)

minsP2Ntxi,b=+1 MP(s) N0LAi,b. (3)

Making an exhaustive search of the two minima in (3) isimpractical. A typical modification is to generate a list L ofNcand candidates. Then, the extrinsic LLR can be computed as

LEi,b 1

N0

min

sLxi,b=1 MP(s)

minsLxi,b=+1 MP(s) N0LAi,b. (4)

III. PROPOSED SISO FSD ALGORITHMThe proposed SISO FSD algorithm is an extension from the

algorithm of hard-output FSD in [7] where an imbalanced-expansion scheme is applied to avoid inefficient full expansionat the (2Nt 1)th level. The method introduces a polygon-shaped admissible region to reduce the unnecessary visits tosome nodes by introducing an extension number limitation,Lm2Nt1, with which only the L

m2Nt1 best nodes are extended

from the mth father node at the (2Nt 1)th level.In order to extend the hard-output FSD to SISO FSD and

provide near maxlogMAP performance, three methods areproposed to achieve the target, which are the optimized HE(OHE), parallel candidate adding (PCA) using bit-flipping strat-

egy, and incorporating the compensation of self-interferenceinto the tree search.

Fig. 1. Illustration of the proposed OHE method in the equivalent real-value

system for Gray mapped 64-QAM modulation.

A. OHE

Since soft inputs prevent the use of simplified methodsrelying on the geometric of constellations P, finding the exactSchnorrEuchner (SE) order requires exhaustive computingand sorting the {MP} of all the

M children, which is

very computationally expensive. In [3], an efficient solutioncalled HE is proposed for SISO STS-SD, where the two bestnodes based on MC and MA are enumerated concurrently,and then, the one with the minimum MP is selected for thenext tree-search step. Unfortunately, when HE is applied toFSD to find the best child node of a certain parent node, someperformance degradation is introduced. That is because the HEcannot guarantee to find the best child node with the minimumMP among all the children.

The proposed OHE adds an additional step after enumerating

the two best nodes s(1)C,i and s(1)A,i based on MC and MA, re-

spectively. It replaces s(1)A,i with another appropriate node sCA,i

which is more likely to have smaller MP than s(1)A,i does. Thesteps of OHE can be described as follows.

1) Enumerate s(1)C,i by quantizing yi/Rii.

2) Enumerate s(1)A,i whose bit vector xi = [xi,1, . . . , xi,Mc/2]

satisfies xi,b

= sign(LAi,b

) for 1b

M

C/2.

3) Obtain Mc/2 sibling nodess(1)A,i,b

|1 b Mc/2

by

flipping each bit ofs(1)A,i in turn, and choose sCA,i, which

is nearest to s(1)C,i in geometry but not equal to s

(1)C,i, among

the candidate sets(1)

A,i,b, s

(1)A,i|1 b Mc/2

.

4) Expand s(1)C,i and sCA,i, and select the node with smallerMP as the best child estimate.

Note that s(1)

A,i,bdenotes the sibling node whose bth bit is the

flipped bit of the bth bit ofs(1)A,i. As shown in Fig. 1, since, at

most, only one bit between sCA,i and s(1)A,i is different and sCA,i

is very close to s(1)C,i in geometry, both MA and MC ofsCA,iare small. Therefore, sCA,i is more likely to have smaller MPthan s

(1)A,i, particularly in the first few iterations.

B. Parallel Candidate Adding Scheme

Although FSD can achieve near ML detection performancefor hard-output MIMO systems, it cannot provide accuratesoft information [8] due to the missing of counterhypotheses.To solve this problem, our proposed PCA scheme introducesanother candidate list L+ which contains the counterhypothesesof the best child estimates of the partial MAP nodes. In our

PCA scheme, after expanding the upper two levels of the real-value tree, only the best child estimates are extended for the rest


3/5


Fig. 2. BER performance of various algorithms for 4 4 64-QAM MIMOsystem with turbo code rate of 1/2.

of the levels. In addition, from the (2Nt 2)th level, the PCAwill locate the partial MAP parent node sPMAP by searching for

the node with the minimum MP in the original candidate listL and then add Mc/2 sibling nodes of the best child estimateofsPMAP as counterhypotheses to the expanded list L+ beforeproceeding to the next level, using the bit-flipping operation.Finally, the extrinsic LLR is computed based on L L+.

C. Compensation of Self-Interference

Like other tree-search algorithms, the SISO FSD benefitssignificantly from the use of column sorting and regulariza-tion of the channel matrix [9]. However, the self-interferencecaused by channel-matrix regularization incurs performancedegradation. In order to recover this performance loss, we

adopt the method developed in [9] where the compensationof self-interference is incorporated into the tree search. Theself-interference term MSI(si) should be subtracted from themetric increment MP(si) as follows:

MP(si) = MC(si) + MA(si) MSI(si) (5)where MSI(si) =2|si|2 and is the regularization parameter.

IV. SIMULATION RESULTS AND COMPLEXITY ANALYSIS

In this section, the proposed SISO FSD algorithm is eval-uated and compared with other algorithms. We considered

a coded 4 4 MIMO system utilizing 64-QAM modulationover a spatially uncorrelated Rayleigh MIMO channel withadditive white Gaussian noise. The 3GPP-LTE turbo code wasused, with constraint length = 4, polynomial: (feedback, redun-dancy) (13, 15)octal, block size = 1024 bits, code rate = 1/2,and eight internal iterations of log-MAP decoding.

Fig. 2 shows the detection performance of the proposedSISO FSD with Lm2Nt1 = 7, 7, 5, 5, 3, 3, 1, 1], the K-best de-tector with K= 50, the list FSD (LFSD) [8] with nS =[1, 1, 1, 1, 2, 2, 8, 8], and the STS-SD with Lmax = 8. Theperformance of STS-SD is given as the baseline referencesince it has been demonstrated to be capable of achievingmaxlogMAP optimality if the LLR clipping value Lmax is

sufficiently large. As shown in Fig. 2, for the noniterative(iteration number I= 1) detection, our proposed SISO FSD

TABLE INUMBER OF VISITED NODES

Fig. 3. Proposed VLSI architecture of SISO FSD.

outperforms MMSE-PIC, K-best detector, and LFSD, withonly a 0.1-dB performance degradation compared with STS-SD at bit error rate (BER) = 104. For the iterative (I= 4)detection, the proposed SISO FSD shows only a very smallperformance loss compared to the iterative K-best detector,

with a 0.35-dB degradation against STS-SD. The performancegap becomes a little bigger as the iteration number increasesbecause the OHE cannot find exactly the best child in the lateriterations as in the first iteration, with the presence of nonzeroa priori information.

The computational complexity in terms of the number ofvisited nodes per vector detection pertaining to a single receiveriteration for a 4 4 64-QAM MIMO system is given in Table I.The total detection complexity is proportional to the number ofiterations in the detector/decoder loop. The proposed SISO FSDalgorithm visits the least number of nodes among all listed tree-search algorithms. By employing the efficient OHE method,SISO FSD avoids the brute-force searching of the best child

and thus significantly reduces the number of visited nodes.

V. VLSI ARCHITECTURE FOR PROPOSED SISO FSD

Our proposed VLSI architecture for the list-based SISO FSDin a 4 4 64-QAM MIMO system is illustrated in Fig. 3. Thearchitecture is based on the multistage architecture of the hard-output FSD [7] and is extended by the OHE strategy and PCAscheme described in Section IV to support SISO processing.

A. High-Level Architecture

By employing the ORVD, the

MP computation (i.e.,

MP(si), MP(si + 1)) in two adjacent levels can be conductedin parallel with Ri,i+1 being zero for i = 1, 3, . . . , 2Nt 1 [7].


4/5

CHEN et al.: VLSI IM PL EM EN TAT IO N O F HI GH-TH RO UG HP UT IT ER AT IV E FI XE D-CO MP LE XI TY SP HE RE DE COD ER 275

Fig. 4. Timing schedule of the proposed VLSI architecture.

As a consequence, the number of processing element (PE)stages is reduced by half compared to those pipelined de-tectors using traditional real-value decomposition [10]. Thearchitecture supports both hard outputs and soft outputs. Thehard-output module generates the original hard-output FSDcandidate list L in which the best path with the minimumMP is found. The soft-output module generates an expandedlist

L+ by employing the PCA scheme and calculates the

LLRs based on the union of the two lists L L+. The PEsin our design are divided into three types: PE-A, PE-B, andPE-C. PE-A is located in the first stage where multiple childnodes are expanded. PE-B performs the single expansion in theremaining three stages. PE-C in the soft-output module adoptsthe bit-flipping strategy to add the counterhypotheses to theexpanded list L+. To identify the partial MAP node amongthe L, the minimum (MIN) search block at the soft-outputmodule is needed to select the node with the smallest LP.With Lm2Nt1 = [7, 7, 5, 5, 3, 3, 1, 1], the number of candidatesin L is NLcand = 32. In the hard-output module, we instantiateeight PE-Bs at each stage where eight nodes can be processed

simultaneously, and thus, four cycles are needed to complete theprocessing of all the candidates in L. The candidate generationunit (CGU) is adopted to generate all possible values of |Ri,jsj |which are shared by the MP(si) calculations at the same level.Additionally, the MA(si) and MSI(si) of all possible symbolsare also precomputed to further enhance the hardware sharing.

Moreover, the best node s(1)A,i with the minimum MA at eachlevel is also identified and buffered in CGU according to thesign ofLAi,b, which avoids full sorting of the set {MA(si)}.The LLR calculation unit (LCU) in the last stage calculatesthe LLRs of each transmitted bit according to (4) based on thecandidate lists L and L+. Fig. 4 shows the timing schedule ofthe proposed VLSI architecture. The latency requires 36 cycles

to detect one symbol vector. The whole architecture works in adeeply pipelined fashion and outputs a detected symbol vectorevery four cycles after the latency.

B. PE-A

The imbalanced-expansion scheme needs to determine theSE order in the upper two levels (i.e., levels 8 and 7). As thepresence of a priori information prevents the applicability ofthe well-known zigzag enumeration used in hard-output FSD,finding the exact SE order requires the full computation andsorting of the {MP(si)}. However, utilizing the property thatR7,8 = 0 and

MP(si) =

|yi

Riisi

|2 +

MA(si)

MSI(si)

for 7 i 8, the computation and sorting of {MP(si)} inlevels 8 and 7 can be carried out independently and simulta-

Fig. 5. Architecture of PE-A.

Fig. 6. Architecture of PE-B in stage 2.

neously. Thus, the number ofMP(si) computations is reducedto 16, saving 77.8% compared to the straightforward approach

of computing MP(si) of 8 + 64 = 72 nodes in the upper twolevels. The number of sorters is also reduced to two. Moreover,the complexity of PE-A can be further reduced by using thetime-multiplexing hardware sharing. That means that, giventhe number of cycles per symbol vector Ncycle = 4, only twoMP(si) computation blocks are instantiated in each level tocompute the metric increments of eight candidates in serial.Fig. 5 gives the architecture of PE-A using these techniquesdescribed earlier. To save more area, two folded bubble sortersare used which can sort the eight candidates in four cycles.The path selecting and combining unit (PSCU) receives thesorted candidates and then selects and combines them to form

MP(s

(7)) =

MP(s8) +

MP(87) for the next stage according

to Lm2Nt1 = [7, 7, 5, 5, 3, 3, 1, 1].

C. PE-B

PE-B is used to implement the single expansion where onlythe best node estimate is selected and preserved using the pro-posed OHE. As shown in Fig. 6, the interference cancellationunit (ICU) in PE-B computes y in (1) to eliminate the interan-tenna interference introduced by previously detected symbols.

To enumerate the best child node s(1)C,i with the minimum MC,

a quantization step Q is required to find the symbol which isnext to yi/Ri,i. The HE unit (HEU) chooses sCA,i according to

step 3) of the OHE method. The MIN block compares the MPofs

(1)C,i and sCA,i and then selects the node with smaller MP.


5/5


Fig. 7. Architecture of PE-C in stage 2.

D. PE-C

Fig. 7 shows the architecture of PE-C, which implements thePCA scheme. The best child estimate selection unit (BCSU) re-ceives the partial MAP node sPMAPi+1 and finds its best child esti-

mate s(1)i by employing the OHE, just the same as it is in PE-B.The candidate adding unit (CAU) uses bit-flipping strategy

to add three sibling nodes of s(1)i , which feedforward to a

multiplexer, and only one of them is selected per cycle for MPcomputation. The serial computation method saves the numberofMP computation blocks in CAU by 66.7% and reduces thenumber of PE-Bs following in the subsequent stages comparedto the parallel method, without impacting the throughput of thewhole architecture.

VI. IMPLEMENTATION RESULTS

The proposed SISO FSD architecture has been implemented

in a 90-nm CMOS technology with a standard-performancestandard-cell library. As shown in Fig. 2, the fixed-point de-tector has shown about 0.1-dB performance loss compared tothe floating-point detector. The core area of the chip occupies2.61 mm2. At the normal 1.0-V supply voltage, the detector canwork at a maximum frequency fmax of 370 MHz, achievinga 2.2-Gbit/s peak throughput per iteration. The throughput isgiven by

=McNtNcycle

fclk. (6)

We compare the proposed SISO FSD with recently reportedMIMO detectors in Table II. The proposed SISO FSD MIMOdetector can achieve significant increase in data throughput andmuch lower latency compared with other detectors. Addition-ally, the SISO FSD achieves a 3.96-Mbit/s/kGE area efficiency,which is the most area efficient among all the reported iterativedetectors. Unlike the depth-first tree-search algorithms whosethroughput and area efficiency will degrade substantially whenoperating in the low-SNR regime, the proposed SISO FSDhas fixed throughput and area efficiency per iteration whilepreserving near maxlogMAP detection performance.

VII. CONCLUSION

This brief presents the algorithm optimization and VLSI

implementation of a SISO FSD. Based on the hard-outputimbalanced FSD in [7], the proposed SISO FSD algorithm

TABLE IIIMPLEMENTATION RESULTS AND COMPARISON

employs the efficient OHE to avoid the exhaustive searchof the best child for the soft-input scenario and adopts thesimple PCA scheme to improve the quality of the output LLRs.In addition, the compensation of the self-interference causedby channel-matrix regularization is incorporated in the treesearch, leading to further performance gain. These proposedtechniques can reduce the complexity significantly and providenear maxlogMAP performance. At the architecture level, theproposed multistage architecture using the time-multiplexinghardware sharing fashion further reduces the area cost. Imple-mentation results show that our SISO FSD outperforms otherreported iterative MIMO detectors in terms of throughput andarea efficiency.

REFERENCES

[1] B. M. Hochwald and S. Brink, Achieving near-capacity on a multipleantenna channel, IEEE Trans. Commun., vol. 51, no. 3, pp. 389399,May 2003.

[2] C. Studer, S. Fateh, and D. Seethaler, ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interference can-cellation, IEEE J. Solid-State Circuits, vol. 46, no. 7, pp. 17541765,Jul. 2011.

[3] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, Ascalable VLSI architecture for soft-input soft-output single tree-searchsphere decoding, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 9,pp. 706710, Sep. 2010.

[4] F. Borlenghi, E. M. Witte, G. Ascheid, H. Meyr, and A. Burg, A772 Mbit/s 8.81 bit/nJ 90 nm CMOS soft-input soft-output sphere de-coder, in Proc. ASSCC, Jeju, Korea, 2011, pp. 297300.

[5] Y. Sun and J. R. Cavallaro, Trellis-search based soft-input soft-output

MIMO detector: Algorithm and VLSI architecture, IEEE Trans. SignalProcess., vol. 60, no. 5, pp. 26172627, May 2012.

[6] L. G. Barbero and J. S. Thompson, Fixing the complexity of the spheredecoder for MIMO detection, IEEE Trans. Wireless Commun., vol. 7,no. 6, pp. 21312142, Jun. 2008.

[7] L. Liu, J. Lofgren, and P. Nilsson, Area-efficient configurable high-throughput signal detector supporting multiple MIMO modes, IEEETrans. Circuits Syst. I, Reg. Papers, vol. 59, no. 9, pp. 20852096,Sep. 2012.

[8] L. G. Barbero and J. S. Thompson, Extending a fixed-complexity spheredecoder to obtain likelihood information for turbo-MIMO systems, IEEETrans. Veh. Technol., vol. 57, no. 5, pp. 28042814, Sep. 2008.

[9] C. Studer and H. Bolcskei, Soft-input soft-output single tree-searchsphere decoding, IEEE Trans. Inf. Theory, vol. 56, no. 10, pp. 48274842, Oct. 2010.

[10] D. Patel, V. Smolyakov, M. Shabany, and P. G. Gulak, VLSI imple-

mentation of a WiMAX/LTE compliant low-complexity high-throughputsoft-output K-best MIMO detector, in Proc. IEEE ISCAS, May 2010,pp. 593596.

Documents

Sphere Decoder