Massive Mimo

  • Upload
    sckid

  • View
    81

  • Download
    3

Embed Size (px)

DESCRIPTION

mimo

Citation preview

  • J Sign Process Syst (2011) 64:7592DOI 10.1007/s11265-010-0499-0

    Exploration of Soft-Output MIMO DetectorImplementations on Massive Parallel Processors

    Robert Fasthuber Min Li David Novo Praveen Raghavan Liesbet Van Der Perre Francky Catthoor

    Received: 13 November 2009 / Revised: 11 May 2010 / Accepted: 12 May 2010 / Published online: 8 June 2010 Springer Science+Business Media, LLC 2010

    Abstract Emerging Software Defined Radio (SDR)baseband platforms are based on multiple processorswith massive parallelism. Although the computationalpower of these platforms would theoreticallyenable SDR solutions with advanced wireless signalprocessing, existing work implements still rather basicalgorithms. For instance, current Multiple-InputMultiple-Output (MIMO) detector implementationsare typically based on simple linear hard-outputand not on advanced near-Maximum Likelihood(ML) soft-output detection. However, only thelatter enables to exploit the full potential of MIMOtechnology. In this work, we explore the feasibility ofadvanced soft-output near-ML MIMO detectorson massive parallel processors. Although suchdetectors are considered to be very challenging dueto their high computational complexity, we combinearchitecture-friendly algorithm design, application

    R. Fasthuber (B) M. Li D. Novo P. Raghavan L. Van Der Perre F. CatthoorIMEC, Kapeldreef 75, 3001 Leuven, Belgiume-mail: [email protected]

    M. Lie-mail: [email protected]

    D. Novoe-mail: [email protected]

    P. Raghavane-mail: [email protected]

    L. Van Der Perree-mail: [email protected]

    F. Catthoore-mail: [email protected]

    specific instructions and instruction-level/data-levelparallelism explorations to make SDR solutionsfeasible. We show that, by applying the proposedcombination of techniques, it is possible to obtain SDRimplementations which can deliver data rates that aresufficient for future wireless systems. For example,a 2 4 Coarse Grain Array (CGA) processor with16-way Single Instruction Multiple Data (SIMD) candeliver 192/368 Mbps throughput for 2 2 64/16-QAM transmissions. Finally, we estimate the area andpower consumption of the programmable solution andcompare it against a traditional Application SpecificIntegrated Circuit (ASIC) approach. This enables us todraw conclusions from the cost perspective.

    Keywords MIMO SDR SSFE LLR CGA ASIC

    1 Introduction

    With the exploding design and processing cost in thedeep sub-micron era, programmable or reconfigurablebaseband solutions are becoming popular. The Soft-ware Defined Radio (SDR) paradigm, which wasmainly successful in the base-station and military seg-ments, is emerging in the handset market. Parallelinstruction set architectures, especially such which com-bine Instruction Level Parallel (ILP) and Data LevelParallel (DLP) features [4, 21, 23, 26, 29], are be-coming very prevailing. Most of these published ar-chitectures offer massive parallelism, i.e. they includemultiple independent computational processing unitsand offer a data parallelism of 100. For instance,the NXP EVP processor includes ten Functional Units

  • 76 J Sign Process Syst (2011) 64:7592

    (FUs) and six of them support 16-way Single Instruc-tion Multiple Data (SIMD) [29]. The SODA processorincludes four Processing Elements (PEs), each sup-porting 32-way SIMD instructions [21]. Theoretically,these massive parallel processors would enable SDRimplementations of advanced wireless signal processingalgorithms. However, only simple SDR systems andalgorithms have been demonstrated and reported inliterature.

    Multiple-Input Multiple-Output (MIMO) technol-ogy offers increased spectral efficiency compared tosingle antenna systems. For this reason, it has becomethe basis of all upcoming wireless communication stan-dards, such as IEEE 802.11n, WiMAX, 3GPP LTEand 3GPP2 UMB. Supporting advanced MIMO tech-nology is therefore a necessity for future SDR sys-tems. However, the implementations in [21, 23, 29] donot support MIMO technology. The references [4, 31]demonstrate MIMO processing, but based on simplelinear detection, which does not enable to fully exploitthe potential of MIMO technology [13]. The implemen-tation of MIMO processing on a Sandblaster processorin [16] does not include the computational dominantsoft-output computation. Wu et al. [33] demonstratesadvanced MIMO processing on a floating-point Graph-ics Processing Unit (GPU). However, the energy-efficiency of such a solution is typically not feasible forwireless devices.

    In a MIMO Space Division Multiplexing (SDM) re-ceiver, the MIMO detector recovers the multiple trans-mitted data streams. For the implementation of thedetector, a wide range of different detection algorithmsis available [2]. Linear detection has a low complexity,but suffers from poor Bit-Error-Rate (BER) perfor-mance. In contrary, soft-output Maximum Likelihood(ML) detection offers maximal performance but atthe cost of very high complexity. Near-ML detectionprovides typically the best trade-off. Recently, a near-ML Selective Spanning with Fast Enumeration (SSFE)detector has been proposed and implemented for SDRsystems [18, 20]. The proposed implementation is basedon hard-output detection. However, with hard-outputdetection, a large part of the remarkable potential ofMIMO technology is still not exploited. The key rea-son is that modern Forward Error Correction (FEC)decoders, such as Turbo and Low Density Parity Check(LDPC) decoders, require soft information as input todeliver the best possible BER performance. In fact,soft-output near-ML MIMO detectors bring 24 dBSignal-to-Noise-Ratio (SNR) gain compared to theirhard-output counterparts and 612 dB SNR gain com-pared to linear detectors. Efficient implementations ofsoft-output near-ML MIMO detectors, which have the

    capability of approaching the limit of Shannon bounds[13], are therefore highly requested.

    Our work explores the feasibility of advanced soft-output MIMO detector implementations on proces-sors with massive parallelizations. We specificallyconsider the TI TMS320C6416 Very Long Instruc-tion Word (VLIW) processor [28] and the ADRESCoarse Grain Array (CGA) processor [22] in ourexplorations.

    First, we design an architecture-friendly algorithmwith low complexity. The resulting algorithm, which ismostly based on area and energy-efficient operators,allows to fully exploit the abundant parallelism of SDRplatforms. Second, we combine Application SpecificInstruction (ASI) design and code transformations tosignificantly reduce the number of required computa-tions and required memory accesses. Then, we performthe dimensioning of ILP/DLP for a given throughputrequirement. We show that, by applying the proposedcombination of techniques, it is feasible to obtain SDRimplementations which can deliver data rates that aresufficient for future wireless systems. For instance, a2 4 CGA processor with 16-way SIMD can deliver192/368 Mbps throughput for 2 2 64/16-QuadratureAmplitude Modulation (QAM) transmissions. To ad-vance the feasibility study further, we estimate thearea and power consumption of the programmablesolution and compare it against a traditional Appli-cation Specific Integrated Circuit (ASIC) design. Fordrawing conclusions, we take existing work on Appli-cation Specific Instruction Set Processors (ASIPs) intoaccount.

    This paper builds on the previous work presentedin [19]. The main extensions of [19] are: 1) design ofdifferent ASIs, 2) mapping and ILP/DLP explorations,3) comparison with ASIC approach. The latter lever-ages on ASIC design results previously published in [9].

    The remaining part of this paper is structured asfollows: Section 2 explains the MIMO system modeland reviews the algorithmic background of soft-outputMIMO detection. In Section 3 the architecture-friendlyalgorithm design of the Log-Likelihood-Ratio (LLR)generator is explained. Section 4 provides an overviewof subsequent implementation and exploration exper-iments. In Section 5 the mapping results for the TITMS320C6416 processor are given. In Section 6 appli-cation specific instructions are proposed, code trans-formations and ILP/DLP explorations are shown andimplementation results for an ADRES based solutionare provided. Section 7 presents the design of an ASICreference. In Section 8 the examined implementationsand existing work are compared. Finally, Section 9concludes the work.

  • J Sign Process Syst (2011) 64:7592 77

    2 Background

    This section reviews the MIMO system model and ex-plains the algorithmic background of the MIMO signaldetection. Especially for Section 3, the knowledge ofthis section is essential.

    2.1 MIMO System Model

    The MIMO system model, which was utilized for thispaper, is illustrated in Fig. 1. For the sake of com-pleteness, the Forward Error Correction (FEC) blocksare also shown. The number of transmit and receiveantennas are denoted as Nt and Nr respectively. Fora C-QAM modulation, a symbol represents one out ofC = 2q constellation points. Note that for 16-QAM asymbol consists of 4bits and for 64-QAM of 6bits. Atonce, the transmitter maps one qNt 1 binary vectorx to a Nt 1 symbol vector s. The transmission of avector s over a flat-fading MIMO channel can be mod-eled as y = Hs + n. Thereby y denotes a Nr 1 symbolvector, H characterizes a Nt Nr channel matrix and nis a noise vector whose entries are independent com-plex Gaussian random variables with mean zero andvariance N0/2.

    2.2 MIMO Signal Detection

    The task of a MIMO detector is to recover the symbolvector s that was sent by the transmitter. Soft-outputMIMO detectors do not only provide the most likelysymbol vector s (like hard-output detectors do), butalso the Log-Likelihood-Ratio (LLR), which is theprobability that a bit is logical 0 or 1, for each bit ins. Modern FEC decoders, such as Turbo and LDPCdecoders, which are an essential part of emerging stan-dards, require soft-input to achieve the best BER per-formance. Most soft-output MIMO detectors can bedecomposed into two main parts: List generator andLLR generator.

    2.2.1 List Generator

    The list generator computes a list L of the most likelysymbol vectors s. Popular schemes for this calcula-

    Modulator

    Modulator

    Demodulator

    Demodulator

    Demodulator

    MIMODetector

    FECDecoderSrc. Sink

    FECEncoder

    H (estimated)

    x s yHs

    Nt Nr

    Figure 1 MIMO system model including FEC blocks.

    tion include linear detection, Successive InterferenceCancellation (SIC) and Maximum-Likelihood (ML)/Near-ML detection. Linear detection has a lowimplementation complexity, but suffers from poor Bit-Error-Rate (BER) performance. In contrary, Max-imum Likelihood (ML) detection offers maximalperformance but at the cost of high complexity. Re-cently, near-ML detection algorithms, which offeralmost ML performance at a significant lower imple-mentation cost, have become popular. Extensive sur-veys about MIMO detection schemes can be found in[2] and [25].

    In this paper, we exploit the near-ML SelectiveSpanning with Fast Enumeration (SSFE) algorithm forlist generation [20]. The SSFE algorithm, which is theresult of our previous work, was explicitly optimizedfor parallel architectures. Contrary to other near-MLalgorithm, such as the traditionally utilized K-Best al-gorithm [6, 7, 12, 27], the SSFE algorithm results ina completely regular and deterministic dataflow struc-ture. This is important for enabling an efficient map-ping on parallel architectures. In addition, the SSFEdoes not require expensive memory-operations. More-over, the SSFE algorithm is based on very simpleand architecture-friendly operations such as additions,subtractions and shifts, which clearly reduces the im-plementation complexity. Besides, the SSFE algorithmis well-suited for scalable implementations, because itoffers a parameter which determines the complexity-performance trade-off of an algorithm instance.

    For ML detection, the MIMO detector is designed tosolve

    s = arg minsNt

    y Hs2 (1)

    where Nt is the set containing all possible Nt 1vector signals s. Solving (1) corresponds to an exhaus-tive search. For near-ML detection, not all, but only alimited number of vector signals s are considered in thesearch.

    A SSFE algorithm instance is uniquely characterizedby a scalar vector m = [m1, . . . , mNt ], mi C. The en-tries in this vector specify the number of scalar symbolssi that are considered at antenna Ni. With the parame-ter m, the complexity-performance trade-off point of analgorithm instance is selected. The computation of s canbe visualized with a spanning tree (Fig. 2). In this treeeach node at level i {1, 2, .., Nt}) is uniquely describedby a partial symbol vector si = [si, si+1, .., sNt ]. Startingfrom level i = Nt, SSFE spans each node at level i + 1to mi nodes at level i. An example of a tree for m =[1, 2, 2, 4] is shown in Fig. 2b.

  • 78 J Sign Process Syst (2011) 64:7592

    Ant. 2=2i

    Ant. 1=1i

    Ant. 3=3i

    Ant. 4=4=Ni t

    (a) K-Best

    rootnode

    Ant. 2=2i

    m2=2

    Ant. 1=1=1

    im1

    Ant. 3=3i

    m3=2

    Ant. 4=4=Ni tm4=4

    one fixed path

    (b) SSFE

    Figure 2 K-Best and SSFE search-tree topologies for 4 4Quadrature Phase Shift Keying (QPSK) modulation. K-Best firstspans the K nodes at level i + 1 to KC nodes. After spanning,K-Best sorts the KC nodes, the K best nodes are selected and therest of the nodes are deleted. These approach results in a non-deterministic data-flow. In contrary, the spanned nodes in SSFEare never deleted. Therefore the dataflow in SSFE is completelyregular and deterministic.

    Initiate the root node with TNt+1 = 0. Starting fromlevel i = Nt, the Partial Euclidean Distance (PED) of asymbol vector si = [si, si+1, .., sNt ] is given byTi(si) = Ti+1(si+1) + ||ei(si)||2 (2)where ||ei(si)||2 describes the PED increment. TheSSFE algorithm has to select a set of si = [si, si+1, .., sNt ]so that the PED increment ||ei(si)||2 from (2) is mini-mized. By assuming a previous QR decomposition of H(H = QR, Q is an orthogonal matrix and R is an uppertriangular matrix), the PED increment ||ei(si)||2 can becomputed as

    ||ei(si)||2 = ||yi Nt

    j=iRijs j||2. (3)

    Equation 3 can be rewritten to

    ||ei(si)||2 = || yi Nt

    j=i+1Rijs j

    bi+1(si+1)

    Riisi||2. (4)

    Since the minimization of ||ei(si)||2 is equivalent tothe minimization of ||ei(si)/Rii||2, (4) can be trans-formed to

    ||ei(si)/Rii||2 = || bi+1(si+1)/Rii i

    si||2 = ||i si||2. (5)

    The task of the SSFE is to select a set of the closestconstellation points around i. This is essentially doneby minimizing ||ei(si)/Rii||2 in (5). When mi = 1, the

    -7 -6 -5 -4 -3 -2 -1 0 1-1

    0

    1

    2

    3

    4

    5

    6

    7

    1

    8Original

    Sliced

    Figure 3 A fast enumeration of eight constellation points shownon an example.

    closest constellation point to i is p1 = Q(i), whereQ isthe slicing operator. When mi > 1, more constellationscan be enumerated based on the vector d = i Q(i).Fundamentally, the technique applied here is to incre-mentally grow the set around i by applying heuristic-based approximations. The heuristic in SSFE is calledFast Enumeration (FE). Figure 3 shows an example.Compared to other schemes [5, 10], the FE is indepen-dent on constellation size, so that handling 64-QAM isas efficient as handling QPSK. Moreover, the FE canbe implemented with simple and architecture friendlyoperators, such as additions, subtractions, bit-negationsand shifts. More information about the SSFE algorithmand BER performance comparisons with other schemescan be found in [18, 20].

    2.2.2 LLR Generator

    The list generator provides a list of most likely candi-date symbol vectors s, denoted by L. The task of theLLR generator is to compute the LLR( j, b) for eachb th bit of the jth scalar symbol in s. This is done for allcandidate symbol vectors s in L.

    For the calculation of LLR( j, b) the max-log ap-proximation can be used [13]. It is formulated as

    LLR( j, b)

    = 12 2

    (mins0j,b

    y Hs2 mins1j,b

    y Hs2). (6)

    0j,b and 1j,b are the disjoint sets of symbol vectors

    that have their b th bit in their jth scalar symbol set to 0and 1 respectively. 2 is the variance of the noise.

  • J Sign Process Syst (2011) 64:7592 79

    Considering that the LLR generator needs to com-pute the LLR only for entries in L, (6) can be trans-formed to

    LLR( j, b)

    = 12 2

    ( minsL0j,b

    y Hs2 minsL1j,b

    y Hs2). (7)

    In (7), 0j,b and 1j,b have been replaced by the joint

    sets L 0j,b and L 1j,b respectively.To simplify the computation, the bit-flipping strategy

    can be applied [30]. When flipping bits in symbol vec-tors in L to 0 or 1 respectively, two new sets L0j,b andL1j,b are obtained. Considering these new sets, (7) canbe modified to

    LLR( j, b)

    = 12 2

    ( minsL0j,b

    y Hs2 minsL1j,b

    y Hs2). (8)

    By applying the QR decomposition to the channelmatrix H, it can be shown that

    y Hs2 = c + y Rs2 (9)

    where y = QHy and c = constant.The equation above enables us to transform (8) to

    LLR( j, b)

    = 12 2

    ( minsL0j,b

    y Rs2 minsL1j,b

    y Rs2). (10)

    The squared Euclidean-Norm 2 of a complexnumber is calculated as ()2 + ()2. To avoidmultiplications, the squared Euclidean-Norm can beapproximated by the Manhattan-Norm

    LLR( j, b)

    = 12 2

    ( minsL0j,b

    y Rs1) minsL1j,b

    y Rs1). (11)

    The Manhattan-Norm 1 of a complex number iscalculated as |()| + |()|. Note, this approximationcauses a BER performance degradation. However, thisdegradation is typically below 1 dB [5, 17].

    So far, the complexity of the LLR generationwas significantly reduced by 1) applying the bit-flipping strategy, 2) applying the QR decomposition

    and 3) replacing the Euclidean-Norm with theManhattan-Norm. In Section 3 we will show that fur-ther comprehensive optimizations are still possible.Importantly, the transformations will maintain I/Oconsistency.

    3 LLR Generator Optimization

    In this section we will propose further techniques todecrease the implementation complexity of the LLRcomputation. In a first step, we apply the partial andincremental update approach to reduce the number ofrequired update operations. In a second step, we reducethe number of required computations and memoryaccesses per update operation by performing algebraicsimplifications and strength reductions on the low leveldata-flow. Importantly, we will also replace all multi-plications by shift and add operations which enables amore efficient implementation. After introducing theproposed optimization techniques, we will estimate theachievable gain.

    3.1 Optimization Technique 1: Selective andIncremental Update Approach

    3.1.1 Overview

    A list generator that works with the Euclidean-normprovides a set L of s with y Hs2 minimized. Byapplying the QR decomposition, the minimization ofy Hs2 is transformed to the minimization of y Rs2. As mentioned in Section 2.2.1, solving the aboveequation can be explained on a spanning tree. In thistree each node at level i {1, 2, .., Nt}) is uniquely de-scribed by a partial symbol vector si = [si, si+1, .., sNt ].The PED of a partial symbol vector si = [si, si+1, .., sNt ]is given by (2). The PED-increment ei(si)2 can becomputed with (3).

    As indicated in Section 2.2.2, the bit-flipping strategycan be applied for reducing the complexity of the LLRgeneration. When flipping the b th bit of the jth scalarsymbol in L to get L1j,b and L0j,b , an original partial sym-bol vector si = [si, . . . , s j, . . . , sNt ] is flipped to si(0)j,b =[si, . . . , s0j,b , . . . , sNt ] and si(1)j,b = [si, . . . , s1j,b , . . . , sNt ] re-spectively. Note, s0j,b means that the bits have beenflipped to 0 and s1j,b means that the bits have beenflipped to 1.

    Considering the explanations above and the opti-mizations proposed in Section 2.2.2, the task of the

  • 80 J Sign Process Syst (2011) 64:7592

    LLR generator can be summarized and formulated asfollows:

    1. Calculation of the Partial Manhattan Distance(PMD) increments, ei(si(0)j,b )1 and ei(si(1)j,b )1, forflipped partial symbol vectors in L1j,b and L0j,b with

    ei(si(0)j,b )1

    = ||yi j1

    k=iRiksk Rijs0j,b

    Nt

    k= j+1Riksk||1ei(si(1)j,b )1

    = ||yi j1

    k=iRiksk Rijs1j,b

    Nt

    k= j+1Riksk||1. (12)

    2. Update of PMD 0 and PMD 1, T0i (si(0)j,b ) and

    T1i (si(1)j,b ), for flipped partial symbol vectors with

    T0i (si(0)j,b ) = T0i+1(si+1(0)j,b ) + ei(si(0)j,b )1

    T1i (si(1)j,b ) = T1i+1(si+1(1)j,b ) + ei(si(1)j,b )1. (13)

    3.1.2 Optimization 1

    Since we leverage on the bit-flipping strategy, the fol-lowing is noticeable: When flipping the b th bit of thejth scalar symbol in the partial symbol vectors, only{si} with i {1, . . . , j} are influenced, but {si} with i { j + 1, . . . , Nt} remain unchanged. Hence, we only needto calculate (12) and (13) for i {1, . . . , j}. Such aselective updating reduces the number of computationssignificantly.

    3.1.3 Optimization 2

    We can rewrite (12) as

    ei(si(0)j,b )1 = yi Nt

    k=iRiksk + Rij(s j s0j,b )1

    = ei(si) Rij(s0j,b s j) ei(0)j,b

    1

    ei(si(1)j,b )1 = ei(si) Rij(s1j,b s j) ei(1)j,b

    1. (14)

    This has two advantages: First, only one scalar sym-bol in the partial symbol vectors is modified. Second,we can reuse ei(si), since it has already been computedby the list generator. If the intermediate results of ei(si)are temporarily stored and accessible by the LLR gen-erator, we only need to calculate ei(0)j,b , ei(1)j,b and reuseei(si) from the storage. With this incremental update

    approach, the complexity for calculating ei(si(0)j,b )1 andei(si(1)j,b )1 is considerably reduced.

    3.2 Optimization Technique 2: AlgebraicSimplification and Strength Reduction

    Practical communication systems adopt Gray-codedmodulation schemes. Two examples of Gray-coded 16-QAM constellations are shown in Fig. 4. Figure 4aillustrates the scheme in 3GPP LTE and IEEE 802.16e-2005 (WiMAX) and Fig. 4b illustrates a common Gray-coded scheme that is used in other systems. As it can beseen, the two Nb/2 most significant bits out of the Nbbits determine the position of the modulated signal onthe I-axis and the two Nb/2 least significant bits deter-mines the position on the Q-axis. The characteristic thatthe position in the I/Q constellation diagram is deter-mined by specific bits in the data word is very usual forGray-coded schemes. Because of this attribute, s0j,b s jand s1j,b s j have a real or imaginary part that is zero.We can exploit this observation for applying algebraicsimplifications.

    Let b (s j) denote the b th bit in the scalar symbol s jand let (b) denote the shift distance of constellations.The latter is relevant when flipping the b th bit of s jfrom 0 to 1. Note that (b) is a real number.

    As mentioned above, when 0 b < Nb/2, the const-ellation-shift is on the Q-axis:

    (ei(0)j,b ) = (Rij)b (s j) (b)(ei(0)j,b ) = (Rij)b (s j) (b)(ei(1)j,b ) = (Rij)b (s j) (b)

    (ei(0)j,b )

    (Rij) (b)

    (ei(1)j,b ) = (Rij) (b)(Rij)b (s j) (b) (ei(0)j,b )

    (a) (b)

    Figure 4 Examples of Gray-coded 16-QAM constellations.a The scheme in 3GPP LTE and IEEE 802.16e-2005; b a commonscheme in other systems.

  • J Sign Process Syst (2011) 64:7592 81

    Contrary, when Nb/2 b < Nb , the constellation-shift is on the I-axis:

    (ei(0)j,b ) = (Rij)b (s j) (b)(ei(0)j,b ) = (Rij)b (s j) (b)(ei(1)j,b ) = (Rij) (b)(Rij)b (s j) (b)

    (ei(0)j,b )

    (ei(1)j,b ) = (Rij) (b)(Rij)b (s j) (b) (ei(0)j,b )

    (15)

    On parallel programmable architectures that supportpredication, the use of conditional executions does notreduced the number of operations. Since this featureis often present in massive parallel architectures, wedid not exploit conditional executions based on b (s j) {0, 1} in the formulations above. However, if run-timeconditional executions do not hamper the efficiency onthe targeted architecture, we can refine the above for-mulations further. Since b (s j) {0, 1}, ei(0)j,b or ei(1)j,bmust be 0:

    With 0 b < Nb/2 and b (s j) = 0:ei(0)j,b = 0

    (ei(1)j,b ) = (Rij) (b)(ei(1)j,b ) = (Rij) (b) (16)

    With 0 b < Nb/2 and b (s j) = 1:(ei(0)j,b ) = (Rij) (b)(ei(0)j,b ) = (Rij) (b)

    ei(1)j,b = 0 (17) With Nb/2 b < Nb and b (s j) = 0:

    ei(0)j,b = 0(ei(1)j,b ) = (Rij) (b)(ei(1)j,b ) = (Rij) (b) (18)

    With Nb/2 b < Nb and b (s j) = 1:(ei(0)j,b ) = (Rij) (b)(ei(0)j,b ) = (Rij) (b)

    ei(1)j,b = 0 (19)The major computations in the above formulations

    are (Rij) (b) and (Rij) (b). These multiplicationscan be converted to simple bit-shifts and additionsif the original input signal y is properly scaled. This

    is often the case, because QAM constellation pointsare usually scaled for normalized average power. Forinstance, in IEEE 802.16e-2005, QPSK, 16-QAM and64-QAM are scaled by 1/

    2, 1/

    10 and 1/

    42, re-

    spectively. If we cancel the scaling at the receiver sideand restore the original QAM constellations instead,which is possible because the I and Q values of con-stellations come from a specific set, (Rij) (b) and(Rij) (b) can be computed with bit-shifts and addi-tions. For Gray-coded 16-QAM and 64-QAM schemes,| (b)| {2, 4, 6, 8, 10, 12, 14}. Therefore the multipli-cation | (b)| can be efficiently implement with max-imally two bit-shifts and one addition. Some examples( denotes left bit-shift operations): x 6 = x 2 + x 4 = x 1 + x 2 x 12 = x 4 + x 8 = x 2 + x 3 x 10 = x 2 + x 8 = x 1 + x 3 x 14 = x 16 x 2 = x 4 x 1

    The proposed optimizations decrease the complexityof the LLR computation significantly. However, due tothe lack of low-level (bit-level) instructions, the pro-posed optimizations are typically not efficiently imple-mentable on state-of-the-art processors. To overcomethis issue, we will propose specific instructions anddemonstrate them on a CGA processor.

    3.3 Estimation of Achievable Gain

    To estimate the gain, which is achievable by imple-menting the proposed optimizations, we compare theoptimized LLR generator to a direct implementationof (11). Thereby we estimate the reduction of realadditions, bit-shifts and memory operations (load andstore operations). Note, the optimized LLR generatordoes not contain multiplications anymore. We calculatethe number of low-level operations based on fixedpoint arithmetic. The overhead of address generationis considered as well. Since we leverage on the SSFEalgorithm for list generation, the addresses can be com-puted with bit-shift and add operations only.

    Figure 5 shows the reduction of operations for 2 2and 4 4 transmissions with 16/64-QAM modulationscheme. We chose 2 2 and 4 4 transmissions be-cause they are commonly used in commercial systems,such as WiMAX and 3GPP LTE. 16-QAM and 64-QAM are considered, because for lower order mod-ulation schemes, such as QPSK and BPSK, simpleexhaustive search can be applied and therefore theproposed optimizations are not relevant. As it can beseen in Fig. 5, multiple options for m are investigated.As mentioned above, the parameter m determines the

  • 82 J Sign Process Syst (2011) 64:7592

    [1 2] [2 4] [4 4] [4 8] [4 16] [8 16]0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    Search Range m of SSFE Search Range m of SSFE

    Search Range m of SSFE Search Range m of SSFE

    Red

    uctio

    n Ra

    teR

    educ

    tion

    Rate

    Red

    uctio

    n Ra

    teR

    educ

    tion

    Rate

    ADDITIONBIT-SHIFTMEM. LD/ST.

    (a) 2 2 16-QAM (b) 2 2 64-QAM

    (c) 4 4 16-QAM (d) 4 4 64-QAM

    [2 4] [4 8] [4 16] [8 16] [8 32] [8 64]0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1ADDITIONBIT-SHIFTMEM. LD/ST.

    [1 1 1 2] [1 1 2 4] [1 1 4 4] [1 2 4 8] [1 2 4 16] [2 4 8 16]0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1ADDITIONBIT-SHIFTMEM. LD/ST.

    [1 1 2 4] [1 2 4 8] [1 2 4 16] [1 2 8 16] [1 2 8 32] [1 4 8 64]0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1ADDITIONBIT-SHIFTMEM. LD/ST.

    Figure 5 Reductions of additions, bit-shifts and memory operations of the proposed LLR generator compared to a reference basedon (11). Besides, all multiplications were removed.

    search range during list generation and therefore thecomplexity of the algorithm instance.

    In addition to the complete removal of expensivemultiplications, we can observe a significant reduc-tion of additions, bit-shifts and memory operations.Specifically, 26% to 83% of additions, 76% to 94% ofbit-shifts and 63% to 91% of memory operations werereduced for the case study. When comparing Fig. 5ad, we can notice that the gain increases with the mod-ulation size, with more antennas and with larger m.The results show that the proposed optimizations leadto substantial improvements and are therefore veryrelevant.

    4 Implementation Overview

    4.1 Targeted Throughput and Algorithm Instance

    Our work targets a minimum throughput of 120 Mbpsfor a 2 2 near-ML soft-output 64-QAM transmission.A previously reported MIMO receiver, based on theADRES processor, delivers similar throughput in lin-ear hard-output MIMO detection mode [4]. We focuson 2 2 64-QAM and 16-QAM systems because oftwo main reasons: 1) this transmission scheme is partof all major wireless communication standards and2) the complexity is lower compared to 4 4, which

  • J Sign Process Syst (2011) 64:7592 83

    makes the implementation on programmable architec-tures more feasible.

    To take advantage of soft-decoding, we target notonly high-throughput but also high communication per-formance (BER). Therefore we allow only maximal0.1 dB SNR degradation in regard to the maximalobtainable performance (ML detection; m = [C, C] inSSFE). To fulfill this specification with the typicallylowest required complexity, we select the SSFE algo-rithm instance to be m = [1, 16] for 16-QAM and m =[1, 64] for 64-QAM. For the BER performance eval-uation, we exploit the 3GPP/3GPP2 Spatial ChannelModels (SCM): Suburban macro, Urban macro and Ur-ban micro. The starting point for the implementationsis a manually written low-level C code for the SSFElist generator, the optimized LLR generator and thereference LLR generator which is based on (11).

    4.2 Outline

    In Section 3.3 we showed that the number of operationsand memory accesses of the proposed LLR generatorare significantly lower compared to the reference LLRgenerator. However, to obtain a higher gain of theproposed optimizations, the architecture has to supportcertain low-level instructions. We will first evaluate theeffective gain by comparing both LLR generators onthe TI TMS320C6416 processor in Section 5. In Section6, we will propose specialized instructions for improv-ing the implementation efficiency. Subsequently, thebenefit of these instructions will be demonstrated onthe basis of an extended CGA processor. Section 7shows the implementation of the proposed MIMO de-tection algorithm as ASIC. By having all of these imple-mentations available, fundamental conclusions aboutthe feasibility of soft-output MIMO detectors on mas-sive parallel processors can be made.

    5 Implementation on a State-of-the-Art TI Processor

    5.1 Architecture

    We chose the TI TMS320C6416 VLIW DSP proces-sor as representative state-of-the-art reference archi-tecture. It includes eight parallel FUs that are organizedin two clusters. Each FU can execute a 32 bit instructionper cycle. The level-1 memory consists of 16 K-Bytedirect-mapped instruction cache (L1P) and 16 K-Byte2-way set-associative data cache (L1D). More informa-tion can be found in [28].

    5.2 Implementation and Results

    We implemented the SSFE list generator, the referenceLLR generator (V0) as well as the optimized LLRgenerator (V1) on the TMS320C6416.

    Table 1 shows the mapping results, which are basedon the decoding complexity of one 2 2 64-QAMMIMO symbol. From the number of instructions wecan observe that the complexity of LLR generationis indeed dominant for soft-output MIMO detection.Therefore focusing on the optimization of the LLRgeneration is essential. When comparing both LLRgenerator implementation we can notice that the pro-posed algorithm reduces the number of instructions, thenumber of L1D accesses as well as the number of L1Dmisses significantly. Remarkably, the number of L1Daccesses has been reduced by a factor of more than 100.This results from the fact that the proposed algorithmrequires less intermediate storage and therefore lessaccesses than the reference algorithm.

    From Table 1 we can further observe that the cyclecount of LLR V1 is higher than the cycle count of LLRV0. At first sight that seems to be confusing becausethe number of instructions are actually lower for LLRV1. However, the results can be explained as followed:Since the TI processor does not offer specialized low-level instructions, the innermost loop of the optimizedalgorithm requires many standard instructions and anhuge amount of live registers (for storing all intermedi-ate values). For this reason, the compiler fails to applysoftware pipelining techniques efficiently and as a con-sequence, the number of cycles are higher than for LLRV0. Nevertheless, if we assume that the TI compiler canmap the algorithm very efficiently, i.e. with Instructionsper Cycle (IPC) to be 6 (as in case of the list gen-erator), 13,841 cycles for decoding one MIMO symbolare required. This optimistic assumptions translate toa throughput of less than 2 Mbps even with the four-way SIMD supported by the TMS320C6416 processor(800 MHz clock frequency). Clearly, for meeting thetargeted 120 Mbps throughput, a processor with spe-cialized instructions and more parallelism is necessary.

    Table 1 Mapping results on the TI TMS320C6416.

    List G. LLR G. LLR G.(SSFE) V0 (ref.) V1 (opt.)

    Instructions 6,209 234,898 76,834Cycles 1,057 57,189 59,479L1D accesses 791 52,651 11,236L1D misses 96 1,260 12

  • 84 J Sign Process Syst (2011) 64:7592

    6 Implementation on an Enhanced ADRES CGAProcessor

    6.1 Architecture

    As demonstrated in Section 5, for achieving the tar-geted throughput of 120 Mbps, a processor which offersmore parallelism and specialized instructions is needed.In this work we investigate in the ADRES CGAprocessor template [22]. An instance of the processortemplate is shown in Fig. 6. As it can be seen, theparameterizable template consists of an Coarse GrainArray (CGA) of densely interconnected FUs that havelocal Register Files (RFs) and individual configurationmemories (loop buffers). Besides, a few VLIW FUsare present. The VLIW FUs and a limited subset ofthe CGA FUs are connected to the global (shared)data RF. This shared data RF enables to exchange databetween both types of FUs. Since the VLIW FUs andthe CGA FUs operate time multiplexed, two modesare available: VLIW mode and CGA mode. All FUssupport SIMD. For our explorations, we leverage onthe DRESC C compiler framework [22]. The compilersupports both, the VLIW mode and the CGA mode. Ingeneral, loops are mapped on the CGA section and therest of the code is scheduled on the VLIW section.

    The ADRES template enables to instantiate aprocessor with a specific amount of ILP and DLP. Bychanging the size of the array (number of FUs), the

    Global Predication Register File

    RF

    VLIWFU

    Global Data Register File

    Data Memory

    Conf

    igur

    atio

    n M

    emor

    ies

    Inst

    r. Ca

    che

    VLIW

    CGA

    Sect

    ion

    Sect

    ion

    VLIWFU

    VLIWFU

    RF

    CGAFURF

    RF

    RFRF RF RF

    RFRF

    CGAFU

    CGAFU

    CGAFU

    CGAFU

    CGAFU

    CGAFU

    CGAFU

    CGAFU

    CGAFU

    CGAFU

    CGAFURF

    RF

    Instruction FetchInstruction Dispatch

    Branch ControlMode ControlCGA & VLIW

    VLIWControl Unit

    VLIW View

    CGA View

    RFRFRF

    CGAFU

    CGAFU

    CGAFU

    CGAFURF

    Figure 6 ADRES instance with 16 CGA FUs and three VLIWFUs.

    amount of supported ILP can be tuned. By changingthe number of SIMD slots in each FU, the amount ofsupported DLP can be tuned. The required amount ofILP and DLP is application dependent. To determinethe best combination of ILP/DLP, i.e. fulfill the perfor-mance requirements with lowest implementation com-plexity, extensive explorations are typically needed.In Section 6.3 we will show this explorations for theMIMO detector design.

    6.2 Application Specific Instructions

    In this section we propose Application Specific Instruc-tions (ASIs) to increase the efficiency of the MIMOdetector implementation. An ASI is a large cluster ofconnected operators. Packing cascaded operators intoone instruction allows to execute many operators in oneclock cycle which leads again to a higher throughput. Inaddition, by leveraging on ASIs, the requirements onintermediate storage are reduced. Furthermore, ASIswill reduce the size of the optimization problem for thecompiler. As a consequence, the compiler can allocateand schedule resources in a more effective and efficientway. In the targeted MIMO detector design the compu-tations are dominated by only few equations. Thereforethe overhead of implementing ASIs for these equationswill be acceptable. An estimation of the cost will begiven in Section 6.4.

    6.2.1 Overview of the Instructions

    In our work, we design ASIs based on algorithmicinsights. Candidates for ASIs are especially computa-tional dominant parts. In Table 1 it has been shownthat in a soft-output MIMO detector the LLR gener-ator is dominant. Therefore designing only ASIs forthe LLR generator could be sufficient. However, tofurther increase the efficiency, we also consider thelist generator as candidate. In total we designed fourASIs, denoted as ASI0-3. ASI0 is designed for theLLR generation; ASI1-3 for the list generation. ASI1

    Table 2 Overview of application specific instructions.

    ASI0 ASI1 ASI2 ASI3

    Function LLR G. List G. List G. List G.(opt.) (ant. 2) (ant. 1a) (ant. 1b)

    Equation (13, 14) (4) (4) (5)

    Multip. 0 0 0 2Add./sub. 8 7 8 12Abs. 2 2 0 0Shift 4 4 8 4Mux. 4 0 0 4

  • J Sign Process Syst (2011) 64:7592 85

    is used for the list generation at antenna 2 (Nt); ASI2and ASI3 are used for the list generation at antenna

    1 (other than Nt). An overview of the designed ASIswith information about implemented equations and

    (a) ASI0 (LLR generator) (c) ASI2 (List generator for antenna 1a)

    (b) ASI1 (List generator for antenna 2)

    (d) ASI3 (List generator for antenna 1b)Figure 7 Datapath implementation of ASI0-3 with embedded control logic. Except for ASI3, no multipliers are used.

  • 86 J Sign Process Syst (2011) 64:7592

    operation count is provided in Table 2. A schematicof the datapath implementations is shown in Fig. 7. Itcan be noticed that a rather large number of operationshave been included within an ASI. Because of the low-level algorithm optimizations of the SSFE list generatorand the LLR generator, ASIs are mostly based on low-cost operators. Except for ASI3, no multiplications arerequired. As it can be expected, ASI0 is the most impor-tant one in terms of number of executions. Specifically,when the detector is configured for 2 2 64-QAM(m = [1, 64]), to detect one MIMO symbol, ASI0 is exe-cuted 1,152 times whereas ASI1-3 are only executed 64times each. When defining the degree of parallelization,this unbalanced distribution of execution time has to beconsidered.

    6.2.2 ASI0 (LLR Generation)

    The datapath of ASI0 is illustrated in Fig. 7a. Re()and Im() denote the real and image part of a signal;PED_inc denotes ei(si). The datapath part 1 calculates(Rij)| (b)| and (Rij)| (b)| as required for ei(0)j,b andei(1)j,b . Because of the applied low-level optimizations(see Section 3.2), the computation can be performedwith shift and add operations. Designed for (14), part2 calculates |((ei(si)) (Rij)| (b)|)|, |((ei(si)) (Rij)| (b)|)|or |((ei(si))(Rij)| (b)|)|, |((ei(si))(Rij)| (b)|)|.

    Part 3 calculates Ti(0)j,b and Ti(1)j,b as formulated in (13).

    The computations that the ASI has to perform dependson Nb , b and s j,b . Instead of multiple fixed ASIs, onlyone flexible ASI, which supports the required parame-ter range, is implemented. To cope with the necessaryflexibility, a small control logic is embedded in thedatapath. Because of embedding the control logic in thedatapath, the cost for providing flexibility is typicallyreduced.

    6.2.3 ASI1-3 (List Generation)

    The datapaths of ASI1-3 are shown in Fig. 7bd. ASI1implements (4) and ASI2-3 implement (5). The upperdatapath part of ASI1 generates the symbol for antenna2 (s = 1..64) with the required data format (see Section3.2). The multiplications, additions and subtractionsin (4) have been implemented with specific SH-A/Sunits. As it can be seen in Fig. 7b), a SH-A/S unitconsists of two shifter, two adder/subtractor and of anembedded control logic. This efficient implementationwas enabled by applying low-level optimizations. Thedivision in (5), which has a low duty cycle, is consideredas part of the channel matrix pre-processing. However,

    the multiplication with 1/Rii is part of ASI3. The low-ermost datapath parts of ASI1 and ASI3 compute thePED increment.

    6.2.4 Implementation

    To estimate the maximal delay and area, the proposedASIs have been implemented in VHDL and synthe-sized with TSMC 90 nm General-Purpose (GP) library.The signal s has been quantized with 8 bit (4 bit for() and 4 bit for ()) and all other data signals with16 bit. For the standard-cell synthesis Synopsys DesignCompiler was used (optimization constraints: min. de-lay and min. area; worst-case design corner). As in [4],we target to run the ADRES processor at 400 MHzclock frequency. To estimate the number of requiredclock cycles for executing a certain ASI, we take thefollowing into account:

    Critical path delay of an ASI based on synthesisresults

    Additional overhead for integrating ASIs into FUs(i.e. delay of large multiplexers)

    Table 3 shows the synthesis results, the requireddata input/output Bit-Width (BW) and the estimatednumber of clock cycles. Although the ASIs combine asubstantial number of cascaded operators, only two orthree clock cycles are required for their execution. Byinserting pipeline register in the ASIs, the total numberof required clock cycles for the MIMO detection canpotentially be reduced. Nevertheless, for the followingILP/DLP explorations, we leverage on the clock cyclenumbers provided in Table 3 (worst-case estimation).

    6.3 ILP/DLP Explorations

    6.3.1 Initial Code-Transformations

    We can apply pre-compiler code transformations tofurther improve the efficiency. Instead of executingthe list generator and the LLR generator independentfrom each other, we can merge the loops from these

    Table 3 Implementation of ASI in TSMC 90 nm.

    ASI0 ASI1 ASI2 ASI3

    Max. delay (ns) 4.91 3.55 2.42 6.19Area (m2) 13,070 9,058 12,278 20,104Required clock cycles 2 2 2 3

    @ 400 MhzData input BW 52 32 36 48

    (per operand)Data output BW 32 56 32 56

  • J Sign Process Syst (2011) 64:7592 87

    components together and execute the whole code inone common loop. By performing this optimization,storage requirements are reduced and the locality ofdata accesses is improved. For decoding one 2 2 64-QAM MIMO symbol, the transformed code with ASIsrequires only 2,577 instructions and 36 L1D accesseswhereas the original code requires 76,834 instructionsand 11,236 L1D accesses on the TMS320C6416. Animprovement of 30 for the number of instructions andmore than 300 for the number of memory accesses isachieved. This results clearly show the positive natureof ASIs.

    6.3.2 Explorations and Results

    We consider a one-processor solution for soft-outputMIMO detection as sufficient because of the followingprevious improvements: 1) architecture-friendly algo-rithm design, 2) extension with ASIs and 3) code-transformations to increase the advantages of ASIs.

    The ILP/DLP explorations (array size and SIMDwidth) are combined with loop transformations to im-prove the scheduling density for a chosen configuration.During the explorations, all FUs in the CGA are as-sumed to support ASI0 (LLR generation), whereasASI1, ASI2 and ASI3 (list generation) are only sup-ported in one VLIW FU. This decision is based onthe knowledge that ASI0 is executed more often thanother ASIs. As in the NXP EVP processor (16-way)[29] or in the SODA processor (32-way) [21], we exploitvery wide SIMD slots. Since the proposed algorithm isexplicitly designed for DLP architectures, SIMD canbe fully exploited without causing a major overhead.Therefore the throughput scales linearly with the num-ber of SIMD slots.

    The results of the ILP/DLP explorations are summa-rized in Table 4. The VLIW and the CGA cycle countinforms on how many cycles have been executed onthe corresponding section (see Fig. 6). The InstructionsPer-Cycle (IPC) metric indicates how-well the available

    ILP has been exploited. A system level metric, to in-form on how efficiently architectural resources havenbeen utilized, is given by Mbps/FU/SIMD.

    From the results we can examine that the tar-geted 120 Mbps for 2 2 64-QAM are achievablewith a feasible amount of parallelization. For instance,192/368 Mbps (64/16-QAM) are obtainable on anADRES instance with eight FUs, each with 16-waySIMD. A commercial design with comparable com-plexity is the NXP EVP processor, which has ten FUs,from which six support 16-way SIMD [29]. In shouldbe mentioned that the instruction issue of the ADRESand NXP EVP FUs are different. From Table 4 wecan further observe that the efficiency of resource uti-lization decreases with the size of the FU array. Forinstance, for 2 2 64-QAM, a 2 4 array achieves anIPC of 6.8 with a scheduling density of 85%. How-ever, a 4 4 array achieves only an IPC of 11.4 witha scheduling density of 71.25%. The Mbps/FU/SIMDmetric provides a similar indication. This behavior canbe explained as follows: The increment of array size re-sults in an exponential increase in complexity. Becauseof high complexity, the compiler can not perform anefficient resource allocation and scheduling anymoreand therefore the IPC goes down.

    Among the explored options, the 2 4 array with16-way SIMD gives the best throughput and the bestscheduling density (85%). Therefore we select this in-stance for further calculations.

    6.4 Area and Power Estimations

    In order to get an idea about the cost of a soft-outputMIMO detector implementation on a massive parallelprocessor, we roughly estimate the area and power con-sumption of a representative ADRES instance. For theestimation, we start from a ADRES template instancewith the following configuration:

    Three VLIW FUs with 64 bit wide datapath

    Table 4 ILP/DLPexplorations targeting120 Mbps throughput with2 2 64-QAM.

    CGA SIMD Total VLIW CGA IPC Total TP TP/FU/size cycles cycles cycles (Mbps) SIMD

    64-QAM2 4 16 397 42 355 6.8 12.1 16 = 192.6 1.514 4 8 276 49 227 11.4 17.4 8 = 139.2 1.096 4 8 228 45 183 15.5 21.1 8 = 168.8 0.888 4 8 212 49 163 18.1 22.6 8 = 180.8 0.71

    16-QAM2 4 16 139 40 99 5.6 23.0 16 = 368.0 2.884 4 8 118 43 75 8.2 27.1 8 = 216.8 1.696 4 8 105 44 61 10.6 30.5 8 = 244.0 1.278 4 8 98 44 54 11.4 32.7 8 = 261.6 1.02

  • 88 J Sign Process Syst (2011) 64:7592

    Eight CGA FUs with 64 bit wide datapath (2 4array)

    512K data memory 32K instruction cache

    We extend this template instance to support therequired 16-way SIMD for ASIs. From Table 3 it canbe seen that an operand width of 64 bit is sufficient forloading data in and out of ASIs. Therefore we choosethe bit-width of an ASI SIMD slot to be 64 bit. Weconsider the following modifications:

    Add the datapath of ASI1, ASI2 and ASI3 for 16-way support to one VLIW FU (considered in theILP/DLP exploration in Section 6.3.2)

    Increase the VLIW global data register file from4K to 8K (the size of 8K for MIMO detection issufficient, because basically only one VLIW FUis active and because ASIs are deployed, there-fore the intermediate storage requirements arereduced)

    Add the datapath of ASI0 for 16-way support to alleight CGA FUs (see ILP/DLP exploration result inSection 6.3.2)

    Extend the default local CGA register file size byfactor 16 (because of the 16-way SIMD of ASI0)

    Note, the extension to 16-way SIMD causeseffectively more overhead than considered here. For in-stance, we neglected the impact on interconnect. How-ever, this is not an issue if we assume that the routingfor the 16-way extension is feasible by employing semi-custom design techniques [24].

    The area consumption of the design was estimatedbased on synthesis results of the components. There-fore the obtained results are considered as a lowerbound. Figure 8 shows the area breakdown. The totalarea in TSMC 90 nm GP technology is about 9 mm2.40% of the total area is occupied by the datapath ofFUs, of which 27% is consumed by the ASIs.

    We roughly estimate the power consumption of theMIMO detector in 64-QAM mode based on statisticalpower simulations and experience/results from previ-ous designs. Thereby we assume that the extendedADRES instance operates 10% in VLIW mode and90% in CGA mode (see Table 4). Based on this roughestimation, the ADRES consumes about 160 mW inVLIW and about 400 mW in CGA mode. The aver-age power consumption is therefore about 376 mW at400 MHz clock frequency.

    Instr. Cache15%

    ASIs27%

    Peripherals1%

    Data Memory16%

    RFs VLIW7%

    RFs CGA16%

    Config. Memories5%

    FUs VLIW3% FUs CGA

    10%

    9 mm2

    Figure 8 Estimated area breakdown of the ADRES instancewith 2 4 CGA and 16-way SIMD.

    7 Implementation as ASIC

    7.1 Architecture

    The ASIC implementation is based on the same algo-rithm as the ADRES implementation. It supports 2 2 near-ML soft-output MIMO detection for 16-QAMand 64-QAM. The architecture, which leverages on arather high degree of data parallelism and on pipelin-ing, can be seen in Fig. 9. Application Specific Block(ASB) 0, which performs the LLR computation, con-sists basically of six parallel ASI0 datapaths. With thisdegree of parallelization, one ASB0 can compute theLLR for q-bits (q = 6 in 64-QAM) and for one antennasimultaneously. ASB1 and ASB2/3, which implementASI1, ASI2 and ASI3 respectively, are the functionalblocks for list generation. Comparison Blocks (CPBs)are required for selecting the symbol with the highestprobability. The control unit, which generates controlsignals for the datapath and the output, is implemented

    ASB0(6x)

    Antenna 2

    Antenna 1

    R

    y

    1/R

    startqam

    clkrst

    ^

    LLR00

    validbusy

    LLR01

    LLR10LLR11

    ASB1 ASB2/3

    Soft-output

    ASB0(6x)

    CPB(6x)

    CPB(6x)

    ASB0(6x)

    CONTROL UNIT

    Figure 9 Architecture of the MIMO detector ASIC for 2 216/64-QAM.

  • J Sign Process Syst (2011) 64:7592 89

    as Finite State Machine (FSM). In this architecture thenumber of required clock cycles to detect one symbolcorresponds to the number of candidate symbols. Sincewe chose an algorithm instance in which 16/64 candi-date symbols for 16/64-QAM modulation are consid-ered, also 16/64 clock cycles to detect one symbol arerequired. This translates to a throughput of 200 Mbpsin 16-QAM mode and 75 Mbps in 64-QAM mode whenassuming a clock frequency of 400 MHz. Opportunitiesto increase the throughput of this architecture include

    Computing the candidate symbols in parallel (i.e.instantiating more blocks)

    Inserting more pipeline register and increase theclock frequency.

    In general, it can be pessimistically assumed that thethroughput scales linearly with area and power. Moreinformation about the scalability of this architecturecan be found in [9].

    7.2 Area and Power Estimations

    The ASIC was implemented in VHDL, synthesized forTSMC 90 nm GP technology with Synopsys DesignCompiler, placed and routed with Cadence SoC En-counter. The resulting layout confirms that 400 MHzclock frequency is feasible. Figure 10 shows the areabreakdown of the design. As assumed, the area isclearly dominated by the LLR computation blocks.This proves one time more that the optimization ofthe LLR computation is very vital. The total area is0.3 mm2. Based on statistical activity, the ASIC im-plementation consumes about 25 mW power. Becausethe ASIC is based on the optimized SSFE and LLRalgorithm, it is more efficient than other state-of-the-art ASICs. A comparison is provided in [9].

    CPB4%

    Top-LevelRegister

    9%

    Ctrl. Unit0%

    ASB0(LLR G.)

    76%

    ASB1-3(List G.)

    11%

    0.3 mm2

    Figure 10 Area breakdown of the ASIC implementation.

    8 Comparison

    Table 5 summarizes the maximal achievable through-put, the area consumption and the power consumptionof the MIMO detector implementations considered inthis work. As shown in Section 5.2, the achievablethroughput on the TI TMS320C6416, which is a state-of-the-art VLIW processor, is less than 2 Mbps for 64-QAM. A soft-output MIMO detector implementationon a conventional VLIW processor is therefore notfeasible. However, by adding the support for special-ized instructions, the throughput of processor imple-mentations can significantly be increased. The resultof the ADRES implementation proves that soft-outputMIMO detection can be deployed on programmablearchitectures and that a considerable high throughputcan be achieved. Nevertheless, when compared to theASIC implementation, the ADRES solution consumes12 more area and 6 more power (for 64-QAM).Considering that both leverage on the same datapath,this overhead raises mainly from data transport, datastorage, control overhead and inefficiency. The con-sidered ADRES instance supports ASIs as well asgeneric instructions. Therefore different algorithms canbe mapped on the architecture. However, since theintroduced ASIs are very specific, typically only theproposed MIMO detector can benefit from it.

    Because of many differences, such as providedflexibility, BER performance, technology or accuracyof estimations, a fair quantitative comparison with workin literature is difficult. Nevertheless, the followingoverview gives an idea about the efficiency of relatedimplementations: The references [1, 4, 8, 15, 16] arebased on simple linear hard-output detection. Althoughthe complexity of linear hard-output detection is muchlower compared to near-ML soft-output detection, theimplementation of [16] consumes about 1 mW powerwhile offering less than 50 Mbps throughput for 2 2 64-QAM. Reference [32] implements a near-MLhard-output detector on a Nvidia 9600GT floating-point Graphical Processor Unit (GPU), which includes64 streams processors and 512MB DDR3 memory.The stream processors are clocked at 1.9GHz and thememory at 2GHz respectively. It is interesting to ob-serve that only 15 Mbps throughput for near-ML hard-output 4 4 64-QAM detection is achievable on thementioned GPU. Reference [31] implements a 4 4linear soft-output detector which achieves a through-put of 600 Mbps. Nevertheless, the LLR computationcomplexity for linear detection is significantly lowerthan for near-ML detection [31]. Besides, the multi-core floating-point architecture of [31] is basically areconfigurable ASIC rather than a processor and the

  • 90 J Sign Process Syst (2011) 64:7592

    Table 5 Comparison ofsoft-output MIMO detectionon different architectures.

    TI processor ADRES proc. ASIC Difference(without ASI) (with ASI) ADRES/ASIC

    Throughput 64-QAM (Mbps)

  • J Sign Process Syst (2011) 64:7592 91

    11. Gries, M., Keutzer, K., Meyr, H., & Martin, G. (2005). Build-ing ASIPS: The mescal methodology. Berlin: Springer.

    12. Guo, Z., & Nilsson, P. (2006). Algorithm and implementationof the K-Best sphere decoding for MIMO detection. IEEEJournal on Selected Areas in Communications, 24(3), 491503.

    13. Hochwald, B. M., & ten Brink, S. (2003). Achieving near-capacity on a multiple-antenna channel. IEEE Transactionson Communications, 51(3), 389399.

    14. Ienne, P., & Leupers, R. (2006). Customizable embed-ded processors: Design technologies and applications. SanFrancisco: Morgan Kauffman.

    15. Jafri, A. R., Karakolah, D., Baghdadi, A., & Jezequel, M.(2009). ASIP-based flexible MMSE-IC linear equalizer forMIMO turbo-equalization applications. In Design, automa-tion and test in Europe (DATE).

    16. Janhunen, J., Silven, O., Juntti, M., & Myllyla, M. (2008).Software defined radio implementation of K-Best list spheredetector algorithm. In International conference on embeddedcomputer systems (IC-SAMOS) (pp. 100107).

    17. Koike, T., Seki, Y., Murata, H., Yoshida, S., & Araki, K.(2005). FPGA implementation of 1 Gbps real-time 4 4MIMO-MLD. Vehicular Technology Conference, 2, 11101114.

    18. Li, M., Bougard, B., Lopez, E., Bourdoux, A., Novo, D.,Van Der Perre, L., et al. (2008). Selective spanning with fastenumeration: A near maximum-likelihood MIMO detectordesigned for parallel programmable baseband architectures.In IEEE intern. conference on communications (ICC) 2008(pp. 737741).

    19. Li, M., Bougard, B., Naessens, F., Van Der Perre, L., &Catthoor, F. (2008). An implementation friendly low com-plexity multiplierless LLR generator for soft MIMO spheredecoders. In IEEE workshop on signal processing systems(SIPS).

    20. Li, M., Bougard, B., Xu, W., Novo, D., Van Der Perre, L.,& Catthoor, F. (2008). Optimizing near-ML MIMO detectorfor SDR baseband on parallel programmable architectures.In Design, automation and test in Europe (DATE) (pp. 444449).

    21. Lin, Y., Lee, H., Woh, M., Harel, Y., Mahlke, S., Mudge,T., et al. (2007). SODA: A high-performance DSP archi-tecture for software-defined radio. IEEE Micro, 27(1), 114123.

    22. Mei, B., Lambrechts, A., Mignolet, J. Y., Verkest, D., &Lauwereins, R. (2005). Architecture exploration for a re-configurable architecture template. IEEE Design and Test ofComputers, 22(2), 90101.

    23. Nilsson, A., Tell, E., & Liu, D. (2008). An 11 mm2 70 mWfully-programmable baseband processor for mobile WiMAXand DVB-T/H in 0.12 um CMOS. In Intern. solid-state circuitsconference (ISSCC) (pp. 266612).

    24. Noll, T. G., Weiss, O., & Gansen, M. (2001). A flexibledatapath generator for physical oriented design. In Europeansolid-state circuits conf. (ESSCIRC) (pp. 393396).

    25. Paulraj, A. J., Gore, D. A., Nabar, R. U., & Bolcskei, H.(2004). An overview of MIMO communicationsa key togigabit wireless. Proceedings of the IEEE, 92(2), 198218.

    26. Ramacher, U. (2007). Software-defined radio prospects formultistandard mobile phones. Computer, 40(10), 6269.

    27. Shariat-Yazdi, R., & Kwasniewski, T. (2007). ReconfigurableK-Best MIMO detector architecture and FPGA implementa-tion. In International symposium on intelligent signal process-ing and communication systems (ISPACS) (pp. 349352).

    28. Texas Instruments (2005). Datasheet of the TMS320C6416f ixed-point digital signal processor.

    29. van Berkel, K., Heinle, F., Meuwissen, P., Moerman, K.,& Weiss, M. (2005). Vector processing as an enabler forsoftware-defined radio in handheld devices. Journal on Ap-plied Signal Proc. (EURASIP), 2005, 26132625.

    30. Wang, R., & Giannakis, G. B. (2004). Approaching MIMOchannel capacity with reduced-complexity soft sphere decod-ing. IEEE Wireless Communications and Networking Confer-ence (WCNC), 3, 16201625.

    31. Wu, D., Eilert, J., & Liu, D. (2009). Implementation of ahigh-speed MIMO soft-output symbol detector for softwaredefined radio. Journal of Signal Processing Systems, 111.ISSN 19398018. doi:10.1007/s11265-009-0369-9.

    32. Wu, M., Gupta, S., Sun, Y., & Cavallaro, J. R. (2009). A GPUimplementation of a real-time MIMO detector. In IEEEworkshop on signal processing systems (SiPS09).

    33. Wu, M., Sun, Y., & Cavallaro, J. R. (2009). Reconfigurablereal-Time MIMO detector on GPU. In Asilomar conf. onsignals, systems and computers (ASILOMAR09).

    Robert Fasthuber received the MSc degree in Hard-ware/Software Systems Engineering from the University of Ap-plied Science (FH) Hagenberg, Austria, in 2007. In September2007 he became a researcher at the Interuniversity MicroElec-tronics Center (IMEC) Belgium and at the Katholieke Univer-siteit (K.U.) Leuven. He is a PhD student since July 2008. Hisresearch focuses on technology-aware low-power architecturesfor Software Defined Radio (SDR) and Cognitive Radio (CR)implementations.

    Min Li received the BE degree (with the highest honor) in July2001 from Zhejiang University, Hangzhou, China. From Septem-ber 2001 to September 2004 he was a postgraduate student at

  • 92 J Sign Process Syst (2011) 64:7592

    Zhejiang University. From January 2003 to September 2003 hewas an employee at Lucent Bell Labs Research China; workingon network processors. From September 2003 to September 2004he was employed at Microsoft Research Asia; working on lowpower mobile computing. From September 2004 to September2009 he was a Ph.D researcher at IMEC Belgium and a PhDstudent at K.U. Leuven. Since October 2009 he is an employedresearcher at IMEC. His main technical interests are low- powersignal processing and low-power implementations.

    David Novo is a member of the wireless group at IMEC anda PhD candidate at the K.U. Leuven. He received the MScdegree in Electronic Engineering from the University Autonomaof Barcelona, Spain, in 2005. His research interests includeenergy-efficient circuits, architectures and systems for wirelesscommunication with special focus on Software Defined Radioimplementations.

    Praveen Raghavan received the bachelor degree from the Na-tional Institute of Technology, Trichy, India and the masterdegree in Electrical Engineering from Arizona State University.He received his Ph.D in Electrical Engineering from the K.U.Leuven in 2009. As wireless systems researcher, he is in charge

    of the next generation platform architecture in the framework ofthe IMEC SDR/CR program. Besides, he is coordinating PhDstudents in the field of low-power design at IMEC. His researchinterests include low power design, low power architectures,system design, and SDR/CR.

    Liesbet Van der Perre received the MSc degree in Electrical En-gineering from the K.U. Leuven, Belgium, in 1992. The researchfor her thesis was completed at the Ecole Nationale Superieurede Telecommunications in Paris. She graduated with a PhD inElectrical Engineering from the K.U. Leuven in 1997; on thetopic of radio propagation modelling. At IMEC, she was a systemarchitect for OFDM ASICs and the project leader for the Turbocodec. She is the scientific director of the wireless research groupfor digital baseband SDR/CR. She is an author and co-author ofover 150 scientific publications.

    Francky Catthoor is a fellow at IMEC Belgium and a fellowat IEEE. He received the Engineering degree and a PhD inElectrical Engineering from the K.U. Leuven, Belgium, in 1982and 1987 respectively. Between 1987 and 1999, he has headedresearch domains in the area of architectural and system-levelsynthesis methodologies, within the DESICS (formerly VSDM)division at IMEC. His main current research activities belong tothe field of architecture design methods and system-level explo-rations for power and memory footprints within real-time con-straints; oriented towards data storage management, global datatransfer optimization and concurrency exploitation. Platformsthat contain both, customizable/configurable architectures and(parallel) programmable instruction-set processors, are targeted.Also deep-submicron technology issues are included.

    Exploration of Soft-Output MIMO Detector Implementations on Massive Parallel ProcessorsAbstractIntroductionBackgroundMIMO System ModelMIMO Signal DetectionList GeneratorLLR Generator

    LLR Generator OptimizationOptimization Technique 1: Selective and Incremental Update ApproachOverviewOptimization 1Optimization 2

    Optimization Technique 2: Algebraic Simplification and Strength ReductionEstimation of Achievable Gain

    Implementation OverviewTargeted Throughput and Algorithm InstanceOutline

    Implementation on a State-of-the-Art TI ProcessorArchitectureImplementation and Results

    Implementation on an Enhanced ADRES CGA ProcessorArchitectureApplication Specific InstructionsOverview of the InstructionsASI0 (LLR Generation)ASI1-3 (List Generation)Implementation

    ILP/DLP ExplorationsInitial Code-TransformationsExplorations and Results

    Area and Power Estimations

    Implementation as ASICArchitectureArea and Power Estimations

    ComparisonConclusionReferences

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 600 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing false /UntaggedCMYKHandling /UseDocumentProfile /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice