J Sign Process Syst (2011) 64:7592DOI 10.1007/s11265-010-0499-0
Exploration of Soft-Output MIMO DetectorImplementations on Massive Parallel Processors
Robert Fasthuber Min Li David Novo Praveen Raghavan Liesbet Van Der Perre Francky Catthoor
Received: 13 November 2009 / Revised: 11 May 2010 / Accepted: 12 May 2010 / Published online: 8 June 2010 Springer Science+Business Media, LLC 2010
Abstract Emerging Software Defined Radio (SDR)baseband platforms are based on multiple processorswith massive parallelism. Although the computationalpower of these platforms would theoreticallyenable SDR solutions with advanced wireless signalprocessing, existing work implements still rather basicalgorithms. For instance, current Multiple-InputMultiple-Output (MIMO) detector implementationsare typically based on simple linear hard-outputand not on advanced near-Maximum Likelihood(ML) soft-output detection. However, only thelatter enables to exploit the full potential of MIMOtechnology. In this work, we explore the feasibility ofadvanced soft-output near-ML MIMO detectorson massive parallel processors. Although suchdetectors are considered to be very challenging dueto their high computational complexity, we combinearchitecture-friendly algorithm design, application
R. Fasthuber (B) M. Li D. Novo P. Raghavan L. Van Der Perre F. CatthoorIMEC, Kapeldreef 75, 3001 Leuven, Belgiume-mail: firstname.lastname@example.org
M. Lie-mail: email@example.com
D. Novoe-mail: firstname.lastname@example.org
P. Raghavane-mail: email@example.com
L. Van Der Perree-mail: firstname.lastname@example.org
F. Catthoore-mail: email@example.com
specific instructions and instruction-level/data-levelparallelism explorations to make SDR solutionsfeasible. We show that, by applying the proposedcombination of techniques, it is possible to obtain SDRimplementations which can deliver data rates that aresufficient for future wireless systems. For example,a 2 4 Coarse Grain Array (CGA) processor with16-way Single Instruction Multiple Data (SIMD) candeliver 192/368 Mbps throughput for 2 2 64/16-QAM transmissions. Finally, we estimate the area andpower consumption of the programmable solution andcompare it against a traditional Application SpecificIntegrated Circuit (ASIC) approach. This enables us todraw conclusions from the cost perspective.
Keywords MIMO SDR SSFE LLR CGA ASIC
With the exploding design and processing cost in thedeep sub-micron era, programmable or reconfigurablebaseband solutions are becoming popular. The Soft-ware Defined Radio (SDR) paradigm, which wasmainly successful in the base-station and military seg-ments, is emerging in the handset market. Parallelinstruction set architectures, especially such which com-bine Instruction Level Parallel (ILP) and Data LevelParallel (DLP) features [4, 21, 23, 26, 29], are be-coming very prevailing. Most of these published ar-chitectures offer massive parallelism, i.e. they includemultiple independent computational processing unitsand offer a data parallelism of 100. For instance,the NXP EVP processor includes ten Functional Units
76 J Sign Process Syst (2011) 64:7592
(FUs) and six of them support 16-way Single Instruc-tion Multiple Data (SIMD) . The SODA processorincludes four Processing Elements (PEs), each sup-porting 32-way SIMD instructions . Theoretically,these massive parallel processors would enable SDRimplementations of advanced wireless signal processingalgorithms. However, only simple SDR systems andalgorithms have been demonstrated and reported inliterature.
Multiple-Input Multiple-Output (MIMO) technol-ogy offers increased spectral efficiency compared tosingle antenna systems. For this reason, it has becomethe basis of all upcoming wireless communication stan-dards, such as IEEE 802.11n, WiMAX, 3GPP LTEand 3GPP2 UMB. Supporting advanced MIMO tech-nology is therefore a necessity for future SDR sys-tems. However, the implementations in [21, 23, 29] donot support MIMO technology. The references [4, 31]demonstrate MIMO processing, but based on simplelinear detection, which does not enable to fully exploitthe potential of MIMO technology . The implemen-tation of MIMO processing on a Sandblaster processorin  does not include the computational dominantsoft-output computation. Wu et al.  demonstratesadvanced MIMO processing on a floating-point Graph-ics Processing Unit (GPU). However, the energy-efficiency of such a solution is typically not feasible forwireless devices.
In a MIMO Space Division Multiplexing (SDM) re-ceiver, the MIMO detector recovers the multiple trans-mitted data streams. For the implementation of thedetector, a wide range of different detection algorithmsis available . Linear detection has a low complexity,but suffers from poor Bit-Error-Rate (BER) perfor-mance. In contrary, soft-output Maximum Likelihood(ML) detection offers maximal performance but atthe cost of very high complexity. Near-ML detectionprovides typically the best trade-off. Recently, a near-ML Selective Spanning with Fast Enumeration (SSFE)detector has been proposed and implemented for SDRsystems [18, 20]. The proposed implementation is basedon hard-output detection. However, with hard-outputdetection, a large part of the remarkable potential ofMIMO technology is still not exploited. The key rea-son is that modern Forward Error Correction (FEC)decoders, such as Turbo and Low Density Parity Check(LDPC) decoders, require soft information as input todeliver the best possible BER performance. In fact,soft-output near-ML MIMO detectors bring 24 dBSignal-to-Noise-Ratio (SNR) gain compared to theirhard-output counterparts and 612 dB SNR gain com-pared to linear detectors. Efficient implementations ofsoft-output near-ML MIMO detectors, which have the
capability of approaching the limit of Shannon bounds, are therefore highly requested.
Our work explores the feasibility of advanced soft-output MIMO detector implementations on proces-sors with massive parallelizations. We specificallyconsider the TI TMS320C6416 Very Long Instruc-tion Word (VLIW) processor  and the ADRESCoarse Grain Array (CGA) processor  in ourexplorations.
First, we design an architecture-friendly algorithmwith low complexity. The resulting algorithm, which ismostly based on area and energy-efficient operators,allows to fully exploit the abundant parallelism of SDRplatforms. Second, we combine Application SpecificInstruction (ASI) design and code transformations tosignificantly reduce the number of required computa-tions and required memory accesses. Then, we performthe dimensioning of ILP/DLP for a given throughputrequirement. We show that, by applying the proposedcombination of techniques, it is feasible to obtain SDRimplementations which can deliver data rates that aresufficient for future wireless systems. For instance, a2 4 CGA processor with 16-way SIMD can deliver192/368 Mbps throughput for 2 2 64/16-QuadratureAmplitude Modulation (QAM) transmissions. To ad-vance the feasibility study further, we estimate thearea and power consumption of the programmablesolution and compare it against a traditional Appli-cation Specific Integrated Circuit (ASIC) design. Fordrawing conclusions, we take existing work on Appli-cation Specific Instruction Set Processors (ASIPs) intoaccount.
This paper builds on the previous work presentedin . The main extensions of  are: 1) design ofdifferent ASIs, 2) mapping and ILP/DLP explorations,3) comparison with ASIC approach. The latter lever-ages on ASIC design results previously published in .
The remaining part of this paper is structured asfollows: Section 2 explains the MIMO system modeland reviews the algorithmic background of soft-outputMIMO detection. In Section 3 the architecture-friendlyalgorithm design of the Log-Likelihood-Ratio (LLR)generator is explained. Section 4 provides an overviewof subsequent implementation and exploration exper-iments. In Section 5 the mapping results for the TITMS320C6416 processor are given. In Section 6 appli-cation specific instructions are proposed, code trans-formations and ILP/DLP explorations are shown andimplementation results for an ADRES based solutionare provided. Section 7 presents the design of an ASICreference. In Section 8 the examined implementationsand existing work are compared. Finally, Section 9concludes the work.
J Sign Process Syst (2011) 64:7592 77
This section reviews the MIMO system model and ex-plains the algorithmic background of the MIMO signaldetection. Especially for Section 3, the knowledge ofthis section is essential.
2.1 MIMO System Model
The MIMO system model, which was utilized for thispaper, is illustrated in Fig. 1. For the sake of com-pleteness, the Forward Error Correction (FEC) blocksare also shown. The number of transmit and receiveantennas are denoted as Nt and Nr respectively. Fora C-QAM modulation, a symbol represents one out ofC = 2q constellation points. Note that for 16-QAM asymbol consists of 4bits and for 64-QAM of 6bits. Atonce, the transmitter maps one qNt 1 binary vectorx to a Nt 1 symbol vector s. The transmission of avector s over a flat-fading MIMO channel can be mod-eled as y = Hs + n. Thereby y denotes a Nr 1 symbolvector, H characterizes a Nt Nr channel matrix and nis a noise vector whose entries are independent com-plex Gaussian random variables with mean zero andvariance N0/2.
2.2 MIMO Signal Detection
The task of a MIMO detector is to recover the symbolvector s that was sent by the transmitter. Soft-outputMIMO detectors do not only provide the most likelysymbol vector s (like hard-output detectors do), butalso the Log-Likelihood-Ratio (LLR), which is theprobability that a bit is logical 0 or 1, for each bit ins. Modern FEC decoders, such as Turbo and LDPCdecoders, which are an essential