A Scalable System Architecture for High-Throughput Turbo-Decoders

Journal of VLSI Signal Processing 39, 63–77, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

A Scalable System Architecture for High-Throughput Turbo-Decoders

MICHAEL J. THUL, FRANK GILBERT, TIMO VOGT, GERD KREISELMAIER AND NORBERT WEHNMicroelectronic System Design Research Group, University of Kaiserslautern, Erwin-Schroedinger-Strasse,

67663 Kaiserslautern, Germany

Received March 20, 2003; Revised May 27, 2003; Accepted June 3, 2003

First online version published in September, 2004

Abstract. The need for higher data rates is ever rising as wireless communications standards move from thethird to the fourth generation. Turbo-Codes are the prevalent channel codes for wireless systems due to theirexcellent forward error correction capability. So far research has mainly focused on components of high throughputTurbo-Decoders. In this paper we explore the Turbo-Decoder design space anew, both under system design anddeep-submicron implementation aspects.

Our approach incorporates all levels of design, from I/O behavior down to floorplaning taking deep-submicroneffects into account. Its scalability allows to derive optimized architectures tailored to the given throughput andtarget technology. We present results for 3GPP compliant Turbo-Decoders beyond 100 Mbit/s synthesized on a0.18 µm standard cell library.

Keywords: Turbo-Decoder, high-throughput, wireless, parallel decoding, interleaving, VLSI architectures

1. Introduction

Turbo-Codes, proposed in 1993 [1], show outstandingerror correction capability and are thus among the mostadvanced channel coding techniques. As applicationsfor wireless systems create demand for more and moredata to be transferred, high-throughput connections be-come a necessity. Thus Turbo-Codes, already part oftoday’s 3GPP wireless communications standard [2],will be used for higher throughputs in the future.

Turbo-Encoders are straight forward to implement,whereas the design of a Turbo-Decoder is a challeng-ing and complex task. For high-throughput, highlyparallelized decoder architectures are needed. Turbo-Decoders consist of component decoders which ex-change information iteratively. Between iterations, thisinformation is reordered following an interleavingscheme. Previous work has mainly focused on theparallelization of the component decoders, namelythe commonly used maximum-a-posteriori (MAP) De-

coder [3, 4]. The interleaving has just recently beenidentified as the major bottleneck and is under studyfor parallelization [5–8].

Up to now architectural decisions for paralleliza-tion have mainly focused on metrics for memorydemand, arithmetic complexity and throughput. Theresulting high-throughput decoder architectures arecomplex designs with millions of gates. Implement-ing these architectures in deep-submicron technolo-gies will lead to global interconnect resulting in tim-ing closure and clock skew problems. Therefore high-throughput Turbo-Decoder systems must be designedwith special focus on interconnect using state-of-the-art synthesis methodologies.

In this work we focus on parallelization andpartitioning for minimized communication, syn-chronization, global interconnect and control. In [9]state-of-the-art synthesis is shown to work well forblocks with less than 100 K gates. Thus we limit thepartitions within our design such that none exceeds

64 Thul et al.

this size. In this paper, we consider the completeTurbo-Decoder system with I/O, component decodersand interleaving network.

The paper is structured as follows: The basics ofTurbo-Codes are reviewed in Section 2 before Sec-tion 3 presents the MAP-Algorithm that is employedin the component decoders. The related work is statedin Section 4 and our new approach briefly discussedin Section 5, which leads to the parallelization layerswe examine in full detail in Section 6. In Section 7 wemotivate our design decisions whose results we presentin Section 8. The conclusions are drawn in Section 9and future work identified.

2. Turbo-Codes

Forward error correction is enabled by introducing par-ity bits. In Turbo-Codes, the original information (�xs),denoted as systematic information, is transmitted to-gether with the parity information (�x1p, �x2p). For theThird Generation Partnership Project (3GPP) [2], theencoder consists of two recursive systematic convolu-tional (RSC) encoders with constraint length K = 4,which can also be interpreted as 8-state finite state ma-chines. One RSC encoder works on the block of infor-mation in its original, the other one in an interleavedsequence (see Fig. 1(a)). On the receiver side a corre-sponding component decoder for each of them exists.The MAP-Decoder has been recognized as the com-ponent decoder of choice as it is superior to the Soft-Output Viterbi Algorithm (SOVA) in terms of commu-nications performance and implementation scalability[10].

The soft-output of each component decoder ( ��) ismodified to reflect only its own confidence (�z) in the re-ceived information bit of being sent either as “0” or “1”.These confidences are exchanged between the decodersto bias their next estimations iteratively (see Fig. 1(b)).During this exchange, the produced information is in-terleaved following the same scheme as in the encoder.

Figure 1. Turbo-Encoder and Turbo-Decoder.

The exchange continues until a stop criterion, see [11],is fulfilled. The last soft-output is not modified and be-comes the soft-output of the Turbo-Decoder ( ��2). Itssign represents the 0/1 decision and its magnitude theconfidence of the Turbo-Decoder in it.

3. The MAP Algorithm

Given the received samples of systematic and paritybits (channel values) for the whole block (yN

0 , whereN is the block length), the MAP algorithm computesthe probability for each bit to have been sent as dk = 0or dk = 1. The logarithmic likelihood ratio (LLR) ofthese probabilities is the soft-output, denoted as:

�k = logPr

{dk = 1

∣∣ yN0

}Pr

{dk = 0

∣∣ yN0

} . (1)

Equation (1) can be expressed using three proba-bilities, which refer to the encoder states Sk

m , wherek ∈ {0..N } and m, m ′ ∈ {1..8}:

The branch metric γk,k+1m,m ′ (dk) is the probability that

a transition between Skm and Sk ′

m+1 has taken place. It isderived from the received signals, the a-priori informa-tion given by the previous decoder, the code structureand the assumption of dk = 0 or dk = 1, for detailssee [12]. From these branch metrics the probability αk

mthat the encoder reached state Sk

m given the initial stateand the received sequence yk

0 , is computed through aforward recursion:

αkm ′ =

∑m

αk−1m · γ

k−1,km,m ′ .

Performing a backward recursion yields the probabil-ity βk+1

m ′ that the encoder has reached the (known) finalstate given the state Sk+1

m ′ and the remainder of the re-ceived sequence yN

k+1:

βkm =

∑m ′

βk+1m ′ · γ

k,k+1m,m ′

A Scalable System Architecture for High-Throughput Turbo-Decoders 65

αs and βs are both called state metrics. Equation (1)can be rewritten as:

�k = log

∑m

∑′m αk

m · βk+1m ′ · γ

k,k+1m,m ′ (dk = 1)∑

m

∑′m αk

m · βk+1m ′ · γ

k,k+1m,m ′ (dk = 0)

. (2)

The original probability based formulation as presentedhere involves a lot of multiplications and has thus beenported to the logarithmic domain to become the Log-MAP Algorithm: Multiplications turn into additionsand additions into maximum selections with additionalcorrection terms (also know as the Jacobian logarithm),see [12] for details.

Arithmetic complexity can further be reduced byomitting the correction term (Max-Log-MAP Algo-rithm) which in general leads to a slight loss in com-munications performance (about 0.1–0.3 dB). How-ever, depending on the accuracy of channel estima-tion and quantization, the Max-Log-MAP Algorithmmay outperform a Log-MAP Algorithm which usesfalse assumptions [13]. Log-MAP and Max-Log-MAPalgorithm are common practice in state-of-the-art im-plementations.

The branch metrics are computed along with the α

recursion and the soft-output in parallel to the β recur-sion, therefore, only one set of state metrics, either αsor βs, has to be stored. In Fig. 2 the recursions asso-ciated with branch metric calculation and state metricstorage are depicted with slim lines, whereas thick linesindicate recursions with combined state metric retrievaland LLR calculation.

The data dependency throughout the whole blockcan be “loosened” by starting the recursions on arbi-trary positions in the block using approximated initial-ization values (see Fig. 2). For this, a recursion on acertain number of preceding bits (acquisition lengthAL) must be performed to obtain sufficiently accurateestimates. Windowing [3] exploits this property to di-vide the data into sub-blocks. These sub-blocks can beprocessed in a serial (SMAP) or parallel (PMAP) man-ner. Figure 2(b) shows a windowed SMAP with a sub-

Figure 2. Windowing schemes for the MAP algorithm.

block length of WL. Memory is reduced by storing thestate metrics of one sub-block only and decoding sev-eral sub-blocks sequentially on the same hardware. Theoptional mapping of sub-blocks to individual nodes forparallel processing allows to trade off hardware for la-tency. Section 6.2 presents a detailed discussion of fur-ther hardware implications of windowing.

4. Related Work

Part of this paper was presented at the 2002 IEEE Work-shop on Signal Processing Systems [14].

On MAP-Decoder optimizations numerous publi-cations exist. A thorough compilation can be foundin [15] and its references, some important examplesare presented as follows: MAP-Decoder processingscheme optimizations are presented in [4, 16, 17], im-plementation issues are discussed in [10, 18], and high-throughput MAP-Decoders in [3, 4, 19].

High-throughput Turbo-Decoders, however, areonly recently emerging as the problem of concurrentinterleaving is tackled. In [20], however, a scalableTurbo-Decoder architecture can be found that com-pletely evades that problem: throughput increase is re-alized by cascading several component decoders. Al-though this circumvents the interleaver related prob-lems it is very inefficient in terms of area and is nottaken into further consideration here.

In general, concurrent interleaving leads to memoryaccess conflicts. One approach is to completely avoidthe conflicts through careful design of the interleavertables: The T@MPO core by IMEC provides a com-plete Turbo Codec with encoder and decoder [21]. Itemploys the conflict free interleaver patterns presentedin [7] and stores them for table lookup. While this ap-proach plays an important role in providing low la-tency, it however hampers its general use as only theT@MPO specific interleaver patterns can be used. In[6] another high-throughput Turbo-Decoder with ded-icated interleavers is presented that apparently uses aninterleaver generation pattern from [22] to perform the

66 Thul et al.

interleaving “on-the-fly”. Though the same limitationson applications apply as for the T@MPO core, it comeswith the advantage that no interleaver patterns have tobe looked up. Thus no interleaver RAMs (or ROMs[21]) are needed, leading to a significant decrease inarea and power consumption. Both approaches, how-ever, limit the range of applications to those that al-low the design aspects of the Turbo-Decoder to influ-ence design decisions taken for the communicationssystem.

Standard conformity can only be guaranteed whenany given interleaver pattern can be used. Thereforethe conflicts during interleaving can not be avoided butmust be resolved. When knowing the interleaver pat-tern at design time, an optimal schedule for read/writesequences can be derived that minimizes the accessconflicts and turns them into wait states [5]. Thosewait states, introduced by stalling interleaving oper-ations that would cause conflicts, however increase thelatency of the whole decoder.

In [8, 23] we have presented architectures capableof concurrent interleaving without stalling. The inter-leaver patterns, however, are still recommend to beknown at design time for optimal buffer sizing, whichensures perfect communication over the interleavingnetwork with minimum area overhead. This approachis presented and extended in Section 6.3.

I/O and interface issues have, to the best of ourknowledge, not been investigated in-depth yet.

5. New Approach

Efficiency of VLSI-architectures, modeled for exampleby [24]

Efficiency = Throughput

Area · EnergyTask

(3)

Figure 3. Parallelization layers of Turbo-Decoder Systems (Extract, Lower Levels omitted).

has long been the only metric for design space explo-ration. Interconnect, communication and feasibility ofsynthesis based approaches in deep-sub-micron tech-nologies, however, are keys issues for implementation,which limit the design space significantly.

State-of-the-art synthesis tools can synthesize up to100 K gates flat [9]. Thus complex designs must becomposed of blocks with this size limitation. Top-level-routing complexity, which leads to timing closure prob-lems [25] has to be reduced by minimizing communi-cation between the building blocks.

Our approach is to design synthesizable mostly inde-pendent sub-blocks of less than 100 K gates with min-imized global communication. Global communicationis replaced by local communication through appropri-ate sub-block design.

6. Parallelization Layers

Within the design space (see Fig. 3) we mainly fo-cus on the Turbo-Decoder level in which we identi-fied three distinct architectural layers: the I/O inter-face layer, the component decoder layer and the inter-leaver network layer. Investigations on lower levels willonly be treated for the component decoder layer whereappropriate.

6.1. I/O Interface Layer

The total time for processing a block is tt = tI + tD +tO , with tI , tO and tD the times for input, output, anddecoding. Assuming that each phase can be sped upby a factor Nx through appropriate parallelization, theprocessing time can be denoted as tI = BL

Ni· tcycle,

tO = BLNo

· tcycle, and tD = BL·2·It·CND

· tcycle, with BLthe block length, I t the number of iterations, and C aconstant overhead factor between 1 and 1.5. Thus the


Figure 4. I/O vs. Decoder parallelization.

system throughput TS = BLtt

is

TS = BLBLNi

· tcycle + BL·2·It·CND

· tcycle + BLNo

· tcycle

= 11Ni

+ 1No

+ 2·It·CND

· fcycle, (4)

The starting point of optimizations is a serial MAPwith NI = NO = ND = 1, which is no paralleliza-tion (see Figs. 2(b) and 4(a)). Decoding time is thedominant factor, but maximizing the parallelism of thedecoder core alone (Fig. 4(b)), however, leads only toT < 1/2 · fcycle for ND → ∞. See Fig. 5(a), where10 iterations and an overhead factor of C = 1.2 areassumed (Experiments on realistic scenarios yieldedC ≤ 1.18). As soon as tD ≤ tI + tO Amdahl’s Law ap-plies and decreasing tD does not significantly improvesystem throughput.

tI and tO both occur in the denominator, hence thelimitation to 1/2 · fcycle. However, combining the inputand output phases to perform in- and output in parallelcomes at no extra cost: The input memories (channelvalues) are filled, while the output memories (LLR) are

read. No conflicts occur and Eq. (4) becomes:

TConcurrentIO = 11

NIO+ 2·It·C

ND

· fcycle, (5)

see also Figs. 4(c) and 5(a). The throughput, how-ever, still saturates quickly with increasing ND (seeFig. 5(a)). Two countermeasures are possible: paral-lelized I/O and performing I/O and decoding concur-rently.

Parallelized I/O is transferring NIO data packets percycle in parallel (see Fig. 4(d)). For this, NIO times moreI/O-ports and NIO times broader busses are needed.The memories, however, are hardly effected as onlythe aspect ratio of RAMs changes. The throughput canno longer be depicted by a simple plot. Figure 5(b)shows a two-dimensional throughput function.

Concurrent decoding and I/O (see Fig. 4(e)), leadsto a demand for an additional set of I/O RAMs. One setfor I/O and the other for decoding. As the I/O RAMsmake up 80% of the total area, we would pay a majorarea penalty. On the other hand, assuming tIO < tD , rel-ative throughput becomes a linear function of ND (seeFig. 5(b)). Which parallelization is chosen is highlydependent on the design context.

68 Thul et al.

Figure 5. Throughput depending on parallelization levels anddegrees.

6.2. Component Decoder Layer

Efficient implementation of the MAP-Decoderstrongly depends on the chosen parallelization level.We refer to decoder parallelization as presented in[4, 15]. Due to interleaving, the second decoder(MAP2) has to wait for completion of the first decoder(MAP1) and vice versa, see Fig. 1. Thus only oneMAP-Decoder, performing alternating MAP1 andMAP2 operations is sufficient. Windowing is appliedto divide the whole data block into sub-blocks, seeSection 3.

The lowest level of parallelization we want to con-sider here is the recursion-unit level. It determines thenumber of state-metrics which are updated per cycle.This and all lower levels of parallelization (as men-tioned in [15]) are fully parallelized, which is calculat-ing all states of a Turbo-Code in parallel on dedicatedhardware using parallel arithmetic units.

Figure 7 shows the architecture for a SMAP-Decodercorresponding to the windowing scheme in Fig. 2(b)without parallelization on recursion level. The forwardrecursion, backward acquisition, and backward recur-sion are performed serially on a single state metric unit.During forward recursion the α state metrics are storedin the α-RAM. These values are read during backwardrecursion to calculate the LLRs. Only one interleavertable is used, performing indirect addressing, see Sec-tion 6.3. The newly calculated LLRs are written backto the previously read address, therefore only one LLR-RAM is needed. The occurring read-write conflict dur-ing backward recursion/LLR calculation is solved byintroducing a buffer of window size. This buffer storesthe calculated branch metrics during forward recursion,which are then reused during backward recursion andthe LLRs used for their calculation can safely be over-written by the newly produced values.

Exploiting the parallelism of the SMAP on recur-sion level leads to dedicated state metric units for theacquisition, the first, and the second recursion. The ar-chitecture of a SMAP with three recursion units de-pends on the sliding window scheme. According to thedirection of the processing two variants are shown (seeFig. 8). The windowing scheme with storage of theα state metrics during forward recursion is shown inFig. 8(a). The LLRs are calculated in parallel to thebackward recursion and are therefore in reverse order.Each recursion unit works in parallel to the others, thusrequiring buffering of branch metrics.

The processing starts in step 1 by writing branchmetrics for the first window into buffer 1 (w1). Thesevalues are read out of this buffer during the forward re-cursion (r1), in parallel the backward acquisition of thesecond window with additional branch metric bufferingin buffer 2 (w2) takes place (Step 2). From step 3 on allthree recursion units are active, backward acquisitionand branch metric buffering (w3) is done on the thirdwindow, whereas forward recursion is performed onthe second window (r2) and backward recursion/LLRcalculation is done on the first window (r1). From step 4on buffer 1 can be written again, because each buffer isbusy (either written or read) for only three consecutive


Figure 6. Serial vs. Parallel Windowing Schemes.

Figure 7. SMAP-Decoder Architecture with one recursion unit.

steps, and the sequence is repeated. Thus only threebuffers are sufficient.

Figure 8(b) shows the windowing scheme with stor-age of the β state metrics during backward recursion,the LLRs are produced in sequential order. If a CyclicRedundancy Check (CRC) is attached to the data blocka simple on-the-fly calculation of the CRC sum for iter-ation control during MAP1 operation is possible. Thisadvantage has to be paid for by additional latency andarea. In contrast to Fig. 8(a) the buffers are busy for fourconsecutive steps. Although only three buffers are ac-tive in one step (e.g. at step 4), the data in the inactivebuffer (in this example buffer 3) has to be preservedfor backward and forward recursion. Therefore fourbranch metric buffers are necessary.

Both architectures perform in-place storage of theLLRs and only one interleaver table is used. BecauseLLRs are constantly read and written during decodingtwo LLR-RAMs are necessary. A SMAP with storageof the α state metrics is superior in area and latencycompared to a SMAP with storage of the β state metricsand therefore chosen.

When the production of LLRs in the original se-quence is mandatory, the SMAP with storage of the α

Figure 8. Windowing Schemes for SMAP with three recursionunits.

state metrics can be extended with two 1-bit shiftregis-ters of windowsize which will revert the bit sequence.Through the additional latency for reverting, the la-tencies of both architectures become equal. The 1-bitshiftregisters, however, are significantly smaller than abranch metric buffer of windowsize. The SMAP withstorage of the α state metrics thus maintains its cleararea advantage even when used in conjunction withon-the-fly CRC calculation

High-throughput MAP-Decoders can be build by us-ing multiple MAP-Decoder units or by parallelizing

70 Thul et al.

the MAP-Decoder itself by decoding multiple win-dows in parallel (see Fig. 6). On the window level awindow requires a recursion for acquisition and sub-sequently a first and a second recursion for producingforward and backward state metrics. These recursionsare the entities on that level and may be carried outin a pipeline fashion [4, 15] or folded onto the samehardware.

6.3. Interleaver Network Layer

Interleaving is scrambling the processing order to breakup neighborhood-relations. It is essential for the per-formance of Turbo-Codes. The interleaver specifies apermutation of the original processing order of thedata-sequence. Although in many systems, like UMTS,the interleaving process is specified by an algorithm,the algorithm is not used to do the interleaving butto build an interleaver table. Interleaver and deinter-leaver tables contain one-to-one mappings of sourceaddresses to destination addresses. Table 1 shows an ex-ample for reordering six LLRs to perform interleaving.Deinterleaving restores the original sequence again. (A3GPP compliant table, for example, contains up to 5114entries.)

One LLR has to be read for every LLR produced,compare Fig. 1. If a maximum of one LLR is producedper time-step (SMAP), interleaving can be performedon the fly through indirect addressing. Moreover, theinterleaving and deinterleaving process can share onetable: For interleaving, the table entries are used to ad-dress the source operands, whereas for deinterleavingthe table entries are used as target addresses. Thus theorder of LLRs within the RAM remains the same andin-place storage of the processed LLRs is enabled [26].The implications of this on the possible sharing of asingle LLR RAM have already been treated above inSection 6.2.

Table 1. Interleaver and deinterleaver tables for six LLRs.

Address Interleaved Address Deinterleaved

1 3 1 6

2 6 2 4

3 5 3 1

4 2 4 5

5 4 5 3

6 1 6 2

Table 2. Interleaver table with according RAMs.

Source Rel. Target Rel.RAM Addr. Addr. Interleaved RAM Addr.

⇒ 1 1 1 3 1 3

1 2 2 6 2 3

1 3 3 5 2 2

⇒ 2 1 4 2 1 2

2 2 5 4 2 1

2 3 6 1 1 1

The parallel architectures needed for higher through-put, however, produce and consume more than one LLRper time-step. Thus, multiple LLRs have to be read andwritten concurrently. Let the number of LLRs read andwritten be denoted as N .

The problem is best illustrated by taking the in-terleaver table of Table 1 for two concurrently pro-duced LLRs and assigning its addresses to two indi-vidual RAMs. Table 2 shows the interleaver table en-tries together with the associated RAMs and relativeaddresses. (From now on, only the interleaver is men-tioned. Of course, the same concepts apply to the dein-terleaver as well.)

The number of write accesses can be determinedfrom the interleaver tables and the production scheme:Assuming that the two LLRs are produced in the orderof ascending relative addresses (i.e. in the first time-step at the absolute addresses 1 and 4) and interleavingis performed according to Table 2, Table 3 shows theresulting number of write accesses.

In the first time-step, for example, one LLR is readfrom source RAM 1 (Addr. 1) and written to targetRAM 1 (Addr. 3). The other one is read concurrentlyfrom source RAM 2 (Addr. 1) and written to targetRAM 1 (Addr. 2), which results in two concurrent writeaccesses for target RAM 1.

Although each RAM is on average accessed onceper time-step, the worst case is two concurrent accesses.Buffers have to be introduced to match the worst case to

Table 3. Write accesses to LLR RAMs.

Time-stepWrite accesses to

RAM 1Write accesses to

RAM 2

1 2 0

2 0 2

3 1 1


the average case. Such buffers, capable of reading mul-tiple data in one cycle while outputting only one, havebeen presented in [8]. There, such buffers where addedin front of each target RAM. Although this provides thedesired functionality, it comes with undesirable imple-mentation properties. A more sophisticated approach toInterleaver-Bottleneck-Breakers (IBB) was presentedin [23] which is discussed in more detail and enhancedin the following:

A local IBB cell (see Fig. 9), is assigned to each LLRproducer (which is a SMAP unit in our architecture).These cells are connected in a ring structure, limitingcommunication to the local producer, the local targetRAM, and the left and right neighbors. Thus, we onlyhave to deal with local interconnect.

Each LLR RAM is local to one producer. Interleav-ing is realized by passing the data plus associated targetRAM and address from the local producer to the localIBB cell. Depending on its destination it is either givenback to be stored in the local LLR RAM or handed onto other IBB cells until its final destination is reached.There the IBB cell hands the data plus address to the as-sociated producer for storage it in its local LLR RAM.

In the ring interleaver bottleneck breaker (RIBB)each cell has a local LLR distributor, which only hasto decide whether the incoming data set is to be storedin the local RAM or has to be send left or right. Thedirection for non-local data is determined based on theshortest path to the target RAM. Two additional LLRdistributors are necessary in each cell for the left andright inputs from the neighboring cells (see Fig. 9).

Figure 9. RIBB cell.

Data coming in from the left side can either get fedthrough or stored to the local RAM. The same holdsfor data from the right side respectively, leading to verysimply control.

In general, the interleaving architecture can beviewed in a networking context [27]. Thus, the RIBBcells can be regarded as nodes in a communication net-work and their interconnect as edges between thosenodes. Each node is connected to � nodes via outputports and to � nodes via input ports. The local inputfrom the producer and the local output to the local RAMare neglected for the time being. During interleaving, adata packet has to travel a certain distance through thenetwork. Given the average distance D̄, the number ofnodes N , and �, the traffic caused in the network canbe derived.

The following assumptions are made for our inves-tigations: The target addresses of incoming data areequally distributed over the address space such thateach incoming data has the probability of 1/N to betargeted at any specific node k. This holds for inter-leavers with good communications performance dueto their spreading property. Moreover, each out-port issupposed to transmit at most one data/address packetper clock cycle. This is due to the implementation ofthe buffers being capable of storing multiple packetswhile retrieving only one.

As every node communicates with all other nodes wecan use the assumptions above to model our communi-cation independent of the actual number of data to betransmitted. For the capacity it is sufficient to assumethat each node transmits one packet to all other nodes.Out of N packets sent per node, one is kept locally andN −1 are transmitted to neighboring nodes. Thus, in Ncycles needed to address all nodes N · (N − 1) packetsare transmitted through the network. As each edge cancommunicate one packet per cycle, the capacity of thenetwork is sufficient when in those N cycles equal orless than N packets pass through each out-port.

Let E be the number of edges,� the node out-degree,and d jk the length of the shortest path between thenodes j and k, then the average distance between anytwo nodes is

D̄ = 1

N (N − 1)·

∑j,k∈[1..N ], j �=k

d j,k .

The N · (N − 1) messages must be conveyed in Ncycles and each message passes on average D̄ edges.Therefore, (N − 1) · D̄ edges are passed each cycle.A necessary but not sufficient condition for a network

72 Thul et al.

capable of interleaving is: not more than E edges canbe passed within one clock cycle, i.e.

E ≥ D̄ · (N − 1)E

N≥ D̄ · N − 1

N(6)

� ≥ D̄ ·(

1 − 1

N

),

which also determines the minimum out-degree giventhe average distance and the number of nodes (Pleasenote that the average distance, in general, depends onthe number of nodes). For RIBB, the average distancecan be determined as:

D̄ = N · (N2 + 2 · ∑N/2−1

k=1 k)

N · (N − 1)= N 2

4 · (N − 1). (7)

Using Eq. (7) in Eq. (6), and setting � = 2 as is ap-propriate for the ring, yields:

2 ≥ N 2

4 · (N − 1)· N − 1

N= N

4(8)

N ≤ 8

Thus a ring type connection is not capable of con-necting more than 8 producers. We therefore improvethe scheme and expand it towards a chordal ring, with� = 3 (see Fig. 10). A simple routing scheme is em-ployed. The data direction (left/right) is derived fromthe target node number and maintained until the tar-get is reached. The additional edges are used when the

Figure 10. Extract of a chordal Ring W = 4.

Table 4. Average number of registers per buffer.

N R̄Left R̄Local R̄Right R̄Chord

2 2 38 2 –

4 4 34 4 –

6 6 29 7 –

8 17 19 17 –

12 9 22 8 3

16 17 16 15 4

20 16 28 17 12

22 18 26 18 17

target is located more than W steps away in the givendirection.

With W the number of nodes being “by-passed”through the additional edges, the average distance fora chordal ring can be denoted as:

D̄ = 1

N − 1·

N2∑

k=1

((k mod W ) +

⌊k

W

⌋)

+N2 −1∑k=1

(1 + ((k − 1) mod W ) +

⌊(k − 1)

W

⌋)(9)

Evaluating Eq. (9) in the context of Eq. (6) yieldsan upper bound of 24 nodes for W ∈ {3, 4, 5} and lessfor any other W . However, as the latency of the wholenetwork is determined by the worst traffic over a singleedge instead of the average traffic, this bound can notbe reached. We found the chordal ring with W = 4 bestbalanced with respect to traffic over individual nodes.It allows us to connect up to 22 nodes.

The size of the buffers shown in Fig. 9 is dominatedby the number of registers R. This number can be ob-tained at design time by profiling the RTL-model. Wederived the number of registers in each buffer for allUMTS interleavers through automated profiling. Thuswe can guarantee error free inter-node communicationfor any interleaver length specified by [2], which en-sures standard conformity. The average number of reg-isters per buffers is depicted in Table 4.

7. Scalable Architecture

Implementation of scalable architectures requires theselection of appropriate levels of parallelization from a


Table 5. 3GPP compliant SMAP at 133 MHz, 3.8 Mbit/s at 6Turbo-Iterations.

Area# of Area Area ratio

Type of units units (mm2) (gate equivalents) (%)

I/O-Data RAM 4 1.891 154 K 60.4

Interleaver RAM 1 0.655 53 K 20.9

Branch-metric RAM 1 0.040 3 K 1.3

State-metric RAM 1 0.248 20 K 7.9

State-metric unit 1 0.070 6 K 2.3

LLR-calculation unit 1 0.136 11 K 4.3

Interface/control 1 0.091 7 K 2.9

Total area 3.131 255 K 100

large design space. For this, we analyzed a SMAP basedTurbo-Decoder implementation according to Fig. 2(b),windowed serial MAP with one recursion unit, to iden-tify optimization potential. Table 5 shows the synthesisresults for this Turbo-Decoder onto a 0.18 µm standardcell library using Synopsys Design Compiler. Two con-clusions can be drawn immediately.

First, more than 80% of the area in Table 5 is dedi-cated to input and output RAMs. Thus parallelization ofthe decoder itself only affects 20% of the total area andmust be more efficient (according to Eq. (3) than dupli-cation of complete MAP- or Turbo-Decoders. Second,using multiple, dedicated state-metric units for acqui-sition, forward and backward recursion increases thetotal area only slightly. Yet it improves throughput atleast by a factor of two, compare Fig. 2(b) and Fig. 6(a)and see also [10]. Using dedicated units for each taskalso increases the maximum clock frequency as mul-tiplexers for hardware sharing are removed from thecritical path.

MAP-Decoder throughput can be further increasedfollowing two different approaches: unrolling andpipelining of the recursions (parallelization on the win-dow level) or decoding of multiple windows in paral-lel on dedicated sub-block units (parallelization on thecomponent decoder level). For the decision which ofthese approaches should be followed, the 100 K gateslimit for modules to be synthesized comes into effect.

Unrolling and pipelining the recursions, as in [3, 4,15], results in a large number of recursion units permodule. Even with maximum hardware folding, thesearchitectures are too large to be implemented in a 100 Kgate module thus leading to timing closure problemswhen using state-of-the-art synthesis.

Figure 11. Double Windowing Scheme—Dataflow partitioninginto sub-blocks and mapping onto SMAP units.

We therefore parallelize on the component decoderlevel which allows us to distribute the recursion unitsover several modules. A double windowing schemeis employed (see Fig. 11): The input data is par-titioned into sub-blocks according to the optimizedPMAP scheme in Fig. 6(c), where one window trans-lates to one sub-block. Each sub-block is associatedwith one component decoder. This minimizes commu-nication and synchronization of adjacent componentdecoders for acquisition and state-metric exchange. ASMAP scheme (as in Fig. 6(a)) is employed by eachdecoder to process its own sub-block. Thus we processmultiple sub-blocks in parallel, each with a dedicatedSMAP unit. This allows us even to integrate one SMAPunit and the associated interleaving-node within onemodule with a complexity of less than 100 K gates.Figure 11 shows the overall architecture of an fourSMAP implementation together with the mapping ofthe sub-blocks onto it. From this, the communicationneeds of the SMAP units (at the borders of their sub-blocks) becomes obvious.

The inevitable high-throughput communication be-tween nodes is implemented using one of our interleav-ing networks (see Section 6.3) depending on the degreeof parallelization. Eight nodes can be connected usingthe ring topology and up to 22 nodes with the chordalring topology. Using these architectures allows a floor-planning in which the longest physical distance overwhich two nodes have to communicate is one tile foran eight SMAP architecture and two tiles for an six-teen SMAP architecture (see Fig. 12). The wirelength

74 Thul et al.

Figure 12. Sample Floorplan for eight (a) and sixteen and (b) sub-blocks.

is hence independent of the number of nodes. Globalcommunication is limited to the latency-insensitive dis-tribution of input and output data.

8. Results

We developed a syntheziable VHDL model with fullyparameterizable quantization, normalization, Max-Log-MAP or Log-MAP selection, window length, ac-quisition length, maximum block length, interleavingnetwork topology and number of SMAP units in thecomponent decoder. The model implements a 3GPP

Table6. Implementation results, 166MHz clock frequency and 6 Turbo-Iterations.

Parallel SMAP Units ND 1 4 6 6 6 8 8 16 16

Parallel I/O NIO 1 1 1 2 con. I/O 1 2 1 2

Area per SMAP Unit (mm2)

I/O-RAM 2.364 0.853 0.795 0.753 1.590 0.607 0.773 0.49 0.68

local RAM 0.312 0.288 0.288 0.288 0.288 0.288 0.288 0.288 0.288

Logic 0.529 0.529 0.529 0.529 0.529 0.529 0.529 0.529 0.529

Area per RIBB Cell (mm2)

Interleaver-RAM 0.655 0.457 0.436 0.436 0.436 0.329 0.329 0.250 0.250

local Buffer NA 0.154 0.134 0.134 0.134 0.208 0.208 0.239 0.239

Logic NA 0.016 0.016 0.016 0.016 0.020 0.020 0.023 0.023

Interface/Ctrl. (mm2) 0.040 0.050 0.063 0.063 0.083 0.075 0.075 0.09 0.09

Total Area (mm2) 3.900 9.290 13.287 12.999 18.041 16.000 17.250 29.194 32.234

Energy per Block (µJ) 48.65 51.68 55.21 50.87 55.21 57.64 55.18 71.81 65.02

Throughput (Mbit/s) 11.7 39.0 50.6 59.6 72.6 59.7 72.7 81.9 108.6

Efficiency (norm.) 1 1.32 1.12 1.47 1.19 1.05 1.24 0.633 0.84

standard [2] compliant Turbo-Decoder, yet with scal-able throughput. Communications performance hasbeen evaluated using a 3GPP compliant downlink sim-ulation chain (static cases, multi-paths cases, etc.). Themodel has been synthesized on a 0.18 µm standard celllibrary using Synopsys Design Compiler.

Table 6 presents the synthesis results for variousarchitecture derivates. Timing and area are extractedusing the worst-case library, while energy consump-tion is derived with the best-case library. (Please notethat through a typo in the original publication [14] theenergy was falsely declared in mJ whereas in fact it isin µJ.)


Table 7. Throughput in Mbit/s for various ND , NIO using a 5114 Bit Block.

Number of SMAP units (ND)

NIO 1 2 4 6 8 10 12 14 16 18 20 22

1 12.3 23.0 39.0 50.6 59.7 64.6 72.1 76.9 81.9 84.2 88.6 89.3

2 12.7 24.7 44.2 59.6 72.7 80.2 92.1 99.9 108.6 112.7 120.7 122.0

4 13.0 25.6 47.4 65.5 81.6 91.2 106.8 117.6 129.8 135.6 147.4 149.4

8 13.1 26.1 49.1 68.9 86.9 97.9 116.2 128.9 143.8 151.0 165.8 168.3

Maximum clock frequencies of 166 MHz for Log-MAP and about 200 MHz for Max-Log-MAP can beachieved. For our design example we consider the useof the Log-MAP algorithm and 6 Turbo iterations.

The throughput for various degrees of paralleliza-tion on different layers is depicted in Table 7. Theimportance of I/O parallelization for higher degreesof component decoder parallelization is evident. For100 Mbit/s, for example, several implementations withvarious degrees of parallelization are worth taking intoaccount, see Table 7. Which one is actually chosen ishighly dependent on the design context.

Though we can precisely predict the throughput fromour model, area and power figures are estimates onlyas data about specific layout, clock tree etc. are notavailable at this stage. Thus exact costs of intercon-nect, important for the trade-off between busses, logicand memory in the case of I/O parallelization, must bedetermined using the specific target technology lateron. Moreover, no low-power techniques, like clock-gating or voltage scaling, are employed in these ex-amples yet. Neither is an elaborate memory hierarchyderived for a specific point in the design space. Forhigh degrees of parallelization, different memory gen-erators can be used for smaller RAMs which lead tosignificantly lower area and power consumption. Forthe 16 producers with I/O parallelization of two wecould thus lower the energy consumption by 44.9%,leading to an energy consumption 35.85 µJ/block or7 nJ/bit. It is not shown in detail as our intention isonly to provide a design space exploration. This limitsthe range of promising implementations on which dif-ferent optimizations should be applied before a specificone is chosen.

9. Conclusion

We combine the design space exploration of high-throughput Turbo-Decoders with state-of-the-art syn-

thesis methodologies. The appropriate levels of paral-lelism are identified, allowing to derive, for a giventhroughput, the most efficient architecture imple-mentable using current design tools and technologies.Our synthesis results for 3GPP standard complianthigh-throughput Turbo-Decoders demonstrate our ap-proach.

Future work will focus on further improving the in-terleaving network and on closer integration of deep-submicron place and route aspects. Detailed compar-isons of our synthesis approach to custom designs arealso envisioned.

Acknowledgments

Part of this work has been sponsored by a cooperationwith ST Microelectronics. Our special thanks goes toFriedbert Berens from the Advanced System Technol-ogy Group of STM, Geneva, Switzerland.

References

1. C. Berrou, A. Glavieux, and P. Thitimajshima, “Near ShannonLimit Error-Correcting Coding and Decoding: Turbo-Codes,” inProc. 1993 International Conference on Communications (ICC’93), Geneva, Switzerland, 1993, pp. 1064–1070.

2. Third Generation Partnership Project. 3GPP home page.www.3gpp.org.

3. H. Dawid, G. Gehnen, and H. Meyr, “MAP Channel Decoding:Algorithm and VLSI Architecture,” in VLSI Signal ProcessingVI. IEEE, 1993, pp. 141–149.

4. A. Worm, H. Lamm, and N. Wehn, “Design of Low-PowerHigh-Speed Maximum a Posteriori Decoder Architectures,” inProc. 2001 Design, Automation and Test in Europe (DATE ’01),Munich, Germany, 2001, pp. 258–265.

5. T. Richter and G. Fettweis, “Parallel Interleaving on Parallel DSPArchitectures,” in Proc. 2002 Workshop on Signal ProcessingSystems (SiPS ’02), 2002, pp. 195–200.

6. R. Dobkin, M. Peleg, and R. Ginosar, “Parallel VLSI Architec-ture for Map Turbo Decoder,” in 13th International Symposiumon Personal, Indoor and Mobile Radio Communications 2002,2002, vol. 1, pp. 384–388.

76 Thul et al.

7. A. Giulietti, L. van der Perre, and M. Strum, “Parallel Turbo De-coding Interleavers: Avoiding Collisions in Accesses to StorageElements,” Electronics Letters, vol. 38, no. 5, 2002, pp. 232–234.

8. M.J. Thul, N. Wehn, and L.P. Rao, “Enabling High-Speed Turbo-Decoding Through Concurrent Interleaving,” in Proc. 2002IEEE International Symposium on Circuits and Systems (ISCAS’02), Phoenix, Arizona, USA, 2002, pp. 897–900.

9. D. Sylvester and K. Keutzer, “Rethinking Deep-Submicron Cir-cuit Design,” IEEE Computer, vol. 32, no. 11, 1999, pp. 25–33.

10. J. Vogt, K. Koora, A. Finger, and G. Fettweis, “Comparison ofDifferent Turbo Decoder Realizations for IMT-2000,” in Proc.1999 Global Telecommunications Conference (Globecom ’99),Rio de Janeiro, Brazil, 1999, vol. 5, pp. 2704–2708.

11. F. Gilbert, F. Kienle, and N. Wehn, “Low Complexity StoppingCriteria for UMTS Turbo-Decoders,” in Proc. 2003-Spring Ve-hicular Technology Confernce (VTC Spring ’03), Jeju, Korea,2003.

12. P. Robertson, P. Hoeher, and E. Villebrun, “Optimal andSub-Optimal Maximum a Posteriori Algorithms Suitable forTurbo Decoding,” European Transactions on Telecommunica-tions (ETT), vol. 8, no. 2, 1997, pp. 119–125.

13. F. Kienle, H. Michel, F. Gilbert, and N. Wehn, “Efficient MAP-Algorithm Implementation on Programmable Architectures,” inKleinheubacher Berichte 2003, Miltenberg, Germany, vol. 46,2002, to appear.

14. M.J. Thul, F. Gilbert, T. Vogt, G. Kreiselmaier, and N. Wehn,“A Scalable System Architecture for High-Throughput Turbo-Decoders,” in Proc. 2002 Workshop on Signal Processing Sys-tems (SiPS ’02), San Diego, California, USA, 2002, pp. 152–158.

15. A. Worm, “Implementation Issues of Turbo-Decoders,” PhDThesis, Institute of Microelectronic Systems, Department ofElectrical Engineering and Information Technology, Universityof Kaiserslautern, ISBN 3-925178-72-4, 2001.

16. C. Schurgers, M. Engels, and F. Catthoor, “Energy Efficient DataTransfer and Storage Organization for a MAP Turbo DecoderModule,” in Proc. 1999 International Symposium on Low PowerElectronics and Design (ISLPED ’99), San Diego, California,USA, 1999, pp. 76–81.

17. Z. Wang, H. Suzuki, and K.K. Parhi, “VLSI ImplementationIssues of Turbo Decoder Design for Wireless Applications,” inProc. 1999 Workshop in Signal Processing Systems (SiPS ’99),Taipei, Taiwan ROC, 1999, pp. 503–512.

18. A. Worm, H. Michel, F. Gilbert, G. Kreiselmaier, M.J. Thul,and N. Wehn, “Advanced Implementation Issues of Turbo-Decoders,” in Proc. 2nd International Symposium on TurboCodes & Related Topics, Brest, France, 2000, pp. 351–354.

19. Z. Wang, Z. Chi, and K. Parhi, “Area-Efficient High Speed De-coding Schemes for Turbo/MAP Decoders,” in Proc. 2001 IEEEInternational Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP ’01), Salt Lake City, UT , USA, vol. 4, 2001,pp. 2633–2636.

20. T. Miyauchi, K. Yamamoto, T. Yokokawa, M. Kan, Y.Mizutani, and M. Hattori, “High-Performance ProgrammableSISO Decoder VLSI Implementation for Decoding TurboCodes,” in Global Telecommunications Conference, 2001(GLOBECOM ’01), San Antonio, TX, USA, vol. 1, 2001,pp. 305–309.

21. A. Giulietti, B. Bougard, V. Derudder, S. Dupont, J.-W. Weijers,and L. van der Perre, “A 80 Mb/s Low-Power Scalable TurboCodec Core,” in Custom Integrated Circuits Conference (CICC)2002, Orlando, Florida, USA, 2002, pp. 389–392.

22. S. Crozier, “New High-Spread High-Distance Interleavers forTurbo-Codes,” in Proc. 20-th biennal Symposium on Communi-cations, Kingston, Canada, 2000, pp. 3–7.

23. M.J. Thul, F. Gilbert, and N. Wehn, “Optimized Concurrent In-terleaving for High-Throughput Turbo-Decoding,” in Proc. 9thIEEE International Conference on Electronics, Circuits and Sys-tems (ICECS ’02), Dubrovnik, Croatia, 2002, pp. 1099–1102.

24. P. Pirsch, Architectures for Digital Signal Processing, JohnWiley & Sons, Inc., 1998.

25. L. Benini and G.D. Micheli, “Networks on Chips: A New SoCParadigm,” IEEE Computer, vol. 35, no. 1, 2002, pp. 70–78.

26. C. Schurgers, F. Catthoor, and M. Engels, “Optimized MAPTurbo Decoder,” in Proc. 2000 Workshop on Signal ProcessingSystems (SiPS ‘00), Lafayette, Louisiana, USA, 2000, pp. 245–254.

27. M.J. Thul, F. Gilbert, and N. Wehn, “Concurrent InterleavingArchitectures for High-Throughput Channel Coding,” in Proc.2003 Conference on Acoustics, Speech, and Signal Processing(ICASSP ’03), Hong Kong, P.R.China, 2003.

Michael J. Thul studied Electrical Engineering at the Universityof Kaiserslautern and received his diploma in 2000. He is work-ing towards his PhD at the same university. His research interestsare efficient data transfer and storage concepts for distributed high-throughput architectures, with special emphasis on channel [email protected]

Frank Gilbert studied Computer Science at the University of Kaiser-slautern and received his diploma in 1998. He has finished his PhD inElectrical Engineering at the same university in 2003. His research in-terests include low-power techniques, multi-processor architectures,and parallel [email protected]


Timo Vogt studied Electrical Engineering at the University of Kaiser-slautern and received his diploma in 2001. He is working towardshis PhD at the same university. His research interests are VLSI ar-chitectures and interconnect issues for [email protected]

Gerd Kreiselmaier studied Electrical Engineering at the Universityof Kaiserslautern and received his diploma in 1998. He is workingtowards his PhD at the same university. His research interests areefficient VLSI architectures for multi-mode channel [email protected]

Norbert Wehn is professor in the Department of Electrical Engineer-ing at the University of Kaiserslautern and chair of the Microelec-tronic System Design Research Group. His interests include VLSIarchitectures, design methodologies, and embedded DRAM. Wehnhas a diploma and PhD from the Darmstadt University of Technol-ogy. He is a member of VDE, the IEEE, the board of the EuropeanDesign Automation Association. He was program chair of the DATEConference 2003 and is general chair of DATE [email protected]

Documents

A Scalable System Architecture for High-Throughput Turbo-Decoders