Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

Embed Size (px)

Citation preview

  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    1/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

    Design Space Exploration of Hard-Decision ViterbiDecoding: Algorithm and VLSI Implementation

    Irfan Habib, zgn Paker, Member, IEEE, and Sergei Sawitzki, Member, IEEE

    AbstractViterbi algorithm is widely used as a decoding tech-nique for convolutional codes as well as a bit detection methodin storage devices. The design space for VLSI implementation ofViterbi decoders is huge, involving choices of throughput, latency,area, and power. Even for a fixed set of parameters like constraintlength, encoder polynomials and trace-back depth, the task of de-signing a Viterbi decoder is quite complex and requires significanteffort. Sometimes, due to incomplete design space exploration orincorrect analysis, a suboptimal design is chosen. This work ana-lyzes the design complexity by applying most of the known VLSIimplementation techniques for hard-decision Viterbi decoding toa different set of code parameters. The conclusions are based on

    real designs for which actual synthesis and layouts were obtained.In authors view, due to the depth covered, it is the most compre-hensive analysis of the topic published so far.

    Index TermsViterbi algorithm, VLSI design, survey.

    I. INTRODUCTION

    V ITERBI decoder is one of the most widely used compo-nents in digital communications and storage devices. Al-though its VLSI implementation is studied in depth over the last

    decades, still every new design starts with design space explo-ration. This can be partially explained by the fact that the designspace is huge. In addition, the optimization criteria and the de-sign figures keep on changing with the advancement in CMOStechnology and design tools.

    Different design aspects of the Viterbi decoder have beenstudied in a number of reserach papers[1][7]. However, mostresearchers concentrate on one specific component of the de-sign (e.g., path metrics unit or survival memory unit). Somewhatmore general studies are presented in[1],[8], and[9]. Still, inauthors view, a systematic and comprehensive analysis summa-rizing and characterizing as many of the trade-offs and imple-

    mentation techniques as possible is missing. This contributionpresents such a survey, providing designers with clear guide-lines and references to find the best solution for every specificcase.

    The scope of the paper is limited to the classical Viterbialgorithm. Trellis decoding techniques like MAP-decoding, T-

    Manuscript received July 30, 2008; revised December 04, 2008. .I. Habib is with the Corporate I&T, NXP Semiconductors, 5656AE Eind-

    hoven, The Netherlands (e-mail: [email protected]).. Paker is with the Advanced R&D, ST-Ericsson, 5656AE Eindhoven, The

    Netherlands (e-mail: [email protected]).S. Sawitzki is with the University of Applied Sciences FH Wedel, 22880

    Wedel, Germany (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TVLSI.2009.2017024

    and M- algorithms are not discussed in detail. In addition, al-most all of the studied designs optimizations, although provenby synthesis for a specific target CMOS technology, are tech-nology independent, so the conclusions should remain valid forthe next two to three solid-state device generations. For thisreason, transistor and interconnect level improvements are notconsidered in this paper.

    We assume that the reader is familiar with convolutionalcodes and basic implementation techniques of the Viterbialgorithm, so we limit the discussion of these subjects in the

    following sections to a minimum. A good summary of refer-ences to both the topics can be found in [8]and[10].The rest of this paper is organized as follows.Section IIgives

    a brief description of convolutional codes and the Viterbi al-gorithm and introduces the parameters influencing the designcomplexity.Sections IIIVdiscuss design decisions for everysubunit of the decoder. Finally,Section VIdraws some conclu-sions.

    All area and timing results presented in this paper are basedon the standard cell 90-nm CMOS technology.

    II. GENERALSTRUCTURE OF THEVITERBIALGORITHM

    Viterbi decoding algorithm is the most popular method to de-code convolutional error correcting codes. In a convolutionalencoder, an input bitstream is passed through a shift register.Input bits are combined using the binary single bit addition(XOR) with several outputs of the shift register cells. Resultingoutput bitstreams represent the encoded input bitstream. Gener-ally speaking, every input bit is encoded using output bits, sothe coding rate is defined as (or if input bits are used).The constraint length of the code is defined as the length ofthe shift register plus one. Finally, generator polynomialsdefine, which bits in the input stream have to be added to formthe output. An encoder is completely described by polyno-mials of degree or less.Fig. 1shows an example of a simple

    convolutional encoder together with the corresponding param-eters.

    As can be easily recognized, a convolutional encoder formsa finite-state machine (FSM), whose state is described by thecontents of the shift register. Every new input bit processed bythe encoder leads to a state transition. One widely used presen-tation form of these state transitions is the so-called trellis dia-gram. An example of such a diagram is shown in Fig. 2. Notethat for the state encoding, represents the least significantbit and the most significant bit respectively. A transitionin the trellis diagram is called a branch. For a binary convolu-tional code, every state in the trellis has two incoming branches

    representing the transmission of one and zero, respectively.1063-8210/$25.00 2009 IEEE

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    2/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 1. Convolutional encoder (index denotes timing in clock cycles).

    Fig. 2. Trellis diagram.

    Every state transition in the encoder results in one codewordbeing produced. After the transmission over a noisy channel,the codewords may get corrupted. Viterbi decoder reconstructsthe initial input sequence of the encoder by calculating themost probable sequence of the state transitions. This is doneby tracing the trellis in a reverse manner while looking at thesequence of incoming codewords. Every codeword is producedas a result of a combination of a certain input bit with specificencoder state. This means, that the likelihood of the statetransitions can be calculated even if received codewords are

    corrupted. By comparing the likelihood values, some transi-tions can be discarded immediately, thereby pruning the searchspace. Once the state transition sequence is determined, thereconstruction of the transmitted bit sequence is trivial.

    At the top level, Viterbi Decoder consists of three units: thebranch metric unit (BMU), the path metric unit (PMU), andthe survivor memory unit (SMU). The BMU calculates the dis-tances from the received (noisy) symbols to all legal codewords.In case of the encoder represented inFig. 1the only legal code-words are 00, 01, 10, and 11. The measure calculatedby the BMU can be the Hamming distance in case of the hardinput decoding or the Manhattan/Euclidean distance in case of

    the soft input decoding (e.g., every incoming symbol is repre-sented using several bits).The PMU accumulates the distances of the single codeword

    metrics produced by the BMU for every state. Under theassumption that zero or one was transmitted, correspondingbranch metrics are added to the previously stored path metricswhich are initialized with zero values. The resulting values arecompared with each other and the smaller value is selected andstored as the new path metric for each state. In parallel, thecorresponding bit decision (zero or one) is transferred to theSMU while the inverse decision is discarded. The structure ofthe add-compare-select circuit which performs the operationsdescribed above will be discussed in detail later.

    Finally, the SMU stores the bit decisions produced by thePMU for a certain defined number of clock cycles (referred as

    Fig. 3. Radix-2 BMU.

    traceback depth, TBD) and processes them in a reverse mannercalled backtracking. Starting from a random state, all state tran-

    sitions in the trellis will merge to the same state after TBD (orless) clock cycles. From this point on, the decoded output se-quence can be reconstructed.

    Coding rate , constraint length , traceback depth TBD,and the number of bits representing each input value (referred assoftbits or input bit width) are the key parameters influencing theperformance and design of the Viterbi decoder. In the followingsections, the affects of changing these parameters are discussedfor every decoder unit separately.

    III. BRANCHMETRICUNIT

    A. Design Space

    BMU is typically the smallest unit of Viterbi decoder. Itscomplexity increases exponentially with (reciprocal of thecoding rate) and also with the number of samples processed bydecoder per clock cycle (radix factor, e.g., radix-2 correspondsto one sample per clock cycle). The complexity increases lin-early withsoftbits. So, the area and throughput of the BMU canbe completely described by these two factors.

    As BMU is not the critical block in terms of area orthroughput, its design looks quite straightforward. The ver-sion calculating the Hamming distance for coding rate 1/2presented in Fig. 3 performs perfectly in terms of both area

    and throughput. A BMU calculating squared Euclidean orManhattan distance is slightly more complex but can be easilymapped to an array of adders and subtracters as well.

    B. Branch Metric Precision

    BM should be able to hold the maximum possible difference(distance) between ideal and received symbols. For hard inputsoftbits , the Hamming distance between a coded word

    and its complement is the maximum possible BM (i.e., the dis-tance between codewords consisting of all 0s and all 1s). Insuch case, the maximum distance is equal to the symbol length

    . So, BM precision is given by

    (1)

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    3/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    HABIBet al.: DESIGN SPACE EXPLORATION OF HARD-DECISION VITERBI DECODING 3

    Fig. 4. Radix-4 BMU.

    In soft input decoding, each received symbol is represented bymore than one bit softbits . In this case, the maximumpossible branch metric is calculated as distance

    (2)

    C. Radix-2 BMU

    The hardware cost of BMU fromFig. 3in terms of areas ofbasic components can be expressed as follows:

    (3)

    where , denote the area of a half adder,full adder, and inverter, respectively.

    D. Radix-4 BMU

    An implementation of radix-4 BMU (processing two code-

    words per clock cycle) is shown in Fig. 4. The underlyingradix-2 BMs are added to form radix-4 BMs[11]. For Radix-4BMU, the area cost is expressed as follows:

    (4)

    BMs are generated assuming that every possible combinationof bits is a valid encoder output. But as value beyond 4 isseldom used and radix is limited to factor 4, the area of the BMU

    is negligible compared to other units. For radix-8 designs, BMUwill require considerably more area and also become slower.

    Fig. 5. Radix-2 ACS.

    In general, radix-8 Viterbi decoder designs are very uncommondue to their overall complexity.

    For radix factors beyond 2, combined with soft input de-coding, BMU can also be implemented as a set of lookup ta-bles or cascaded combination of adders and lookup tables. De-pending on the cell library and the technology used, such im-plementations may occupy less silicon area than the straightfor-ward approach. Still, taking into account the moderate area ofthe BMU, these area savings are almost invisible at the top level.

    IV. PATHMETRICUNIT

    A. Design Space

    PMU is a critical block both in terms of area and throughput.The key problem of the PMU design is the recursive nature ofthe add-compare-select (ACS) operation (path metrics calcu-lated in the previous clock cycle are used in the current clock

    cycle asFig. 5shows).Optimization techniques of PMU in Viterbi decoder im-

    plementation are shown in Fig. 6. In order to increase thethroughput or to reduce the area, optimizations can be intro-duced at algorithmic, word or bit level. To obtain a very highthroughput, parallelism at algorithmic level is exploited. Byalgorithmic transformations, the Viterbi decoding is convertedto a purely feedforward computation. This allows independentprocessing of input blocks. The algorithmic parallel blockprocessing methods intend to achieve unlimited concurrency byindependent block decoding of input stream. These techniquesresult in quite high area figures. But as technological advance-

    ments are making the devices shrink, they are getting moreattractive. Still, for a specific case, if required throughput can beachieved by utilizing word or bit level optimization techniques,there is no specific need to use algorithmic transformations.

    Word level optimizations work on folding (serialization)or unfolding (parallelization) the ACS recursion loop. In thefolding technique, the same ACS is shared among a certain setof states. This technique trades off throughput for area. This isan area efficient approach for low throughput decoders, thoughin case of folding, routing of the PMs becomes quite complex.With unfolding, two or more trellis stages are processed in asingle recursion (this is called look ahead). If look ahead isshort then area penalty is not high. Radix-4 lookahead (i.e.,

    processing two bits at a time) is a commonly used technique toincrease decoders throughput.

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    4/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 6. Design space for the VLSI implementation of the Viterbi Algorithm.

    Fig. 7. Bit-level optimization techniques.

    Bit level techniques optimize the add-compare-select oper-ations in PM computation and decision process. Bit level op-timizations can be used to speed up the ACS operation itself.These techniques differ from word level techniques by having a

    lower area penalty.Fig. 7shows the various bit level optimiza-tion techniques. Compare-select-add (CSA) is a retimed varia-tion of the ACS operation as shown inFig. 8. It computes allpossible paths originating from a node and based on decision itretains the minimum PM path. Thus, addition and comparisoncan be processed in parallel.

    Further, addition and comparison operation can be imple-mented using different arithmetics. Carry-look-ahead (CLA)technique[23]uses carry look ahead adders for addition andcomparison (subtraction) operation. Ripple-add carry-chaincompare exploits bit level parallelism between addition andcomparison operations. In this technique, subtraction is used

    for comparison. Both addition and subtraction operations startwith the least significant bit (LSB). Thus, as soon as LSB of thesum is computed, it can be used for comparison (see Fig. 9).In the carry save approach, instead of propagating the carryfrom LSB to the most significant bit (MSB), carry generatedduring each bit addition is saved. The comparison starts withMSB, taking into account the saved carry bits. This makes thecomparison operation quite complex. The advantage, however,is that the addition and comparison operations can be pipelinedat bit level to reduce the critical path. Due to the complexcompare operation there is a significant area penalty.

    In the bit serial approach the addition and comparison oper-ations are carried out in serial manner. It uses 1 bit adder over

    cycles to compute bit addition (or subtraction). Thus areasaving in addition and comparison operations is achieved.

    Fig. 8. CSA.

    A conventional fully parallel implementation of the Viterbidecoder implies using one radix-2 ACS per state of the trellis,

    processing one stage of trellis at a time. For decoders with veryhigh throughput or very low area constraints, some optimizationtechniques at system level appear attractive. These techniquesachieve area figures lower or throughput figures higher, thanthe conventional fully parallel implementation of the algorithmwould provide.

    The area and throughput comparisons of system level tech-niques are presented inTable I. Throughput and complexity arenormalized to radix-2, fully parallel (1 ACS/state) Viterbi de-coder. denotes constraint length while denotes look-aheadsteps. Number of ACS units is denoted by .

    In this paper only bit level optimizations are discussed in

    depth, since analysis of the algorithmic and word level tech-niques concerning their effects on area and throughput is ratherstraightforward as the summary inTable Isuggests.

    B. Path Metric Precision

    The register temporarily storing path metrics should be wideenough to avoid overflow errors in the PMU operations. Moduloarithmetic is usually used for this purpose[29]. PM bitwidth isdetermined as follows:

    (5)

    where is given as

    (6)

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    5/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    HABIBet al.: DESIGN SPACE EXPLORATION OF HARD-DECISION VITERBI DECODING 5

    Fig. 9. Parallelism between addition and comparison operations in ripple adderconfiguration (pm1(x) and bm1(x) denotes the th bit in the numerical repre-sentation of the corresponding metric, respectively).

    TABLE ITHROUGHPUT ANDCOMPLEXITYCOMPARISON OFALGORITHMIC ANDWORD

    LEVELOPTIMIZATION TECHNIQUES

    C. Choice Among Various PMU Implementations

    As mentioned before, there are several options to implementthe ACS operation. Radix-2 ACS (seeFig. 5) is a basic designwith lowest area and throughput figures. Other techniques suchas CSA (seeFig. 8) and Radix-4 (seeFig. 10) provide a higherthroughput than radix-2 ACS at the cost of higher area as will bediscussed in this section later. Note that the architecture of theadders and subtracters for the ACS implementation is an orthog-onal design choice. More specifically, the addition and subtrac-

    tion operations can be described either in a behavioral fashion(i.e., giving the freedom to choose a particular implementation

    Fig. 10. Radix-4 ACS with hierarchical comparisons.

    to the synthesis tool) or by instantiating specific arithmetic cir-cuits manually such as ripple carry or carry look ahead (CLA)adders and subtracters.

    As PMU is a fairly regular unit where a basic module (ACS) isinstantiated several times (e.g., incase of the fully parallelapproach), it is especially important to optimize this module.For bit addition, the delay of CLA implemented in a loga-rithmic configuration[23]can be expressed as

    (7)

    where delays of logic components such as XOR, AND, OR, andfull adder are denoted as , and respec-tively. Similarly, delay through ripple carry adder can be ex-pressed as a function of half-adder and full adder delays

    (8)

    As indicated in[23]the efficiency of CLA is better than rippleonly for large values of , e.g., .

    Fig. 5 highlights the critical path in radix-2 ACS. Whenimplemented in CLA configuration, each circle represents anaddition/subtraction with precision . Assuming no par-

    allelism between addition and comparison operations, criticalpath delay can be expressed as

    (9)

    For ACS based on ripple-carry adders, critical path delay can beexpressed as

    (10)

    With CLA implementation, we can observe a logarithmic in-

    crease in critical path delay with path metric precision ratherthan a linear one as in ripple implementation. As mentioned

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    6/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    earlier, by using ripple configuration in radix-2 ACS, bit levelparallelism between addition and comparison can be exploited(see Fig.9). Therefore, the delay of addition and comparison op-erations is just one full adder delay more than the bitsripple carry addition. While using CLA for addition operation inradix-2 ACS the parallelism between addition and comparison

    operations can not be fully exploited. Thus, addition and com-parison operations require nearly twice the delay of abits CLA addition. Therefore, using CLA for addition in radix-2ACS for low values will not improve the critical pathdelay significantly. The advantage of using CLA becomes ap-parent when the number of precision bits increases. The criticalpath delay for CLA is while for ripplecarry, it is .

    Fig. 8 highlights the critical path of a CSA implementation. Ifthe adders ofFig. 8are implemented using the CLA technique,delay can be expressed as

    (11)

    While in case of a ripple carry implementation for the sameadders, the delay can be expressed as

    (12)

    In case of CSA, the critical path delay using ripple carry addersis , while using CLA, it is . Here theadvantage of using CLA becomes apparent. Thus for highervalues of , CSA using CLA adders and subtracters is abetter design choice to achieve high throughput.

    Fig. 10highlights the critical path through a radix-4 ACS.It can be observed that hierarchical comparisons cause majordelay in this design. Alternative version compares each sumpair-wise and then encodes the results of the comparison, asshown inFig. 11. Depending on the value, the criticalpath through the radix-4 ACS becomes shorter in this case. En-coder and MUX can also be combined to form an integratedselection unit. The delay of Radix-4 ACS while using CLA foraddition and comparison operation can be expressed as

    (13)

    The delay of Radix-4 ACS using ripple carry technique for ad-dition operation can be expressed as

    (14)

    As radix-4 ACS is larger than radix-2 ACS or CSA, a va-riety of configurations are possible in its implementation.Area and throughput requirements determine the choice ofa certain configuration. For radix-4, the critical path delay is

    for CLA and for ripple carry

    adder. As compared to the delay of radix-2 ACS, there is anoverhead of an encoder. For the cascade comparison as shown

    Fig. 11. Radix-4 ACS using parallel comparisons.

    inFig. 10the delay will be . For low valuesof andsoftbits, is low. Using CLA for lower number

    of bits does not provide any throughput benefit.To compare the designs in terms of area and logic delay and

    to determine the optimal implementation of each design, radix-2ACS and CSA as well as radix-4 ACS units were implementedand synthesized using the above mentioned techniques. Dif-ferent implementations of the designs were synthesized with arelaxed timing constraint to be able to see the effect of the ar-chitectural choices.Figs. 12and13show the plots of area andcritical path delay of different PMUs as obtained from synthesisexperiments. As expected, the conventional radix-2 ACS is abetter choice in terms of area, provided that the throughput re-quirement is met. For instance, the gain in area for the CLA

    implementation can be up to 50%.Advantage of the CLA becomes visible only for highervalues. For instance, using carry-lookahead in the CSA

    technique, achieves 68% higher throughput at a cost of 62%higher area compared to a radix-2 ACS with ripple carry adders.

    In[7], it is mentioned that CSA can achieve twice as muchthroughput as ACS. However, fromFig. 13, it is clear that CSAdoes not have a critical path half that of ACS in any implemen-tation. This difference can be explained by the fact that the datain[7]come from the prelayout analysis for a single ACS/CSAoperation and fixed parameter set, wherebyFig. 13is based onpost-layout analysis of the complete PMU for many differentparameter values. Contrary to the widespread opinion, we con-

    clude that the throughput advantage of CSA over ACS is muchless than factor 2 compared to almost doubled area cost.

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    7/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    HABIBet al.: DESIGN SPACE EXPLORATION OF HARD-DECISION VITERBI DECODING 7

    Fig. 12. Area figures for various ACS implementation techniques (

    works with

    , and

    workswith , and ).

    Fig. 13. Critical path delay for various ACS implementation techniques ( works with ,and works with

    , and

    ).

    D. Behavioral Versus Structural Synthesis

    A commonly agreed issue in VLSI design is the proper choice

    of HDL description and coding style. It has been observed thatbehavioral description for combinational logic and its structuraldescription produce different synthesis results. Structural de-scription of combinational logic was especially popular at thetimes when synthesis tools were not intelligent enough. Nowa-days, the tools are getting more efficient, therefore behavioraldescription has become a suitable option. This is quite evidentfrom the graphs in Figs. 12and13. As shown in the figures,the behavioral description of ACS operation has comparabledelay and area as the structural one. The reason is that behav-ioral description gives the synthesis tools the freedom to opti-mally choose logic for implementation. For example, given cer-tain timing constraints, the tool opts for CLA for addition and

    comparison operations during synthesis of the ACS unit, eventhough a behavioral model is used.

    TABLE IIAREA AND TIMINGCOMPARISON OFBIT-SERIAL ANDBEHAVIORAL

    DESCRIPTION OFACSFOR AND

    E. Bit-Serial (BS) ACS

    A BS design for ACS has been proposed in[24]for a highconstraint length, low throughput Viterbi decoder. This designuses some circuit level optimizations and custom designedFIFOs and aims to reduce power and area. The bit serialapproach uses 1-bit adder over number of cycles toimplement bit addition (or subtraction). Considerablearea saving is expected if is large. Referring to (5)

    and(6), the value is high for high values of andsoftbits. To compare bit-serial approach with other designs, acorresponding architecture was implemented with a standardcell library. The area and delay comparisons are shown inTable II. Area reduction nearly by a factor of two is seen for BSfor . In case of , the area advantage is negligible.BM precision and PM precision determine the number of1-bit adders (area for addition operation) in bit-serial approachand consequently the area advantage in using bit-serial overbit-parallel. Thus, bit-serial approach has an advantage in caseswhere the throughput requirements are met while the constraintlength is large.

    Note that two different clock frequencies are required for aPMU built upon bit-serial ACS, since path metrics unit needsseveral clock cycles to process a single input symbol while otherunits typically run at one symbol per clock cycle rate.

    F. Critical Factors in PMU Design

    PMU contains a feedback loop. Therefore, in general,throughput cannot be increased by pipelining (though asalready mentioned, some look-ahead techniques allow loopunrolling and pipelining, but are extremely expensive in termsof area and power consumption). Thus, PMU is a bottleneckin achieving high throughput for the Viterbi decoder. At the

    same time, it occupies a substantial area considering the wholedesign. The most critical design parameters for the PMU are theinput bitwidth, i.e., softbitsand the number of bits to processper clock cycle (radix-2 versus radix-4). These parameters willbe discussed in the following subsections in more detail.

    1) Softbits: Number of bits per symbol, i.e.,softbitsinflu-ences not only the area and delay of the PMU but also the perfor-mance of the Viterbi decoder in terms of Bit Error Rate (BER).Fig. 14shows the dependency of BER on the (energyper bit/noise) ratio for additive white Gaussian noise (AWGN)channel with . As seen in the figure, there is a signif-icant gain in BER when moving from hard-input to 4 bits persymbol. Further increase has almost invisible gains (indeed, the

    curves for 5 and 6 are almost identical). Note thatwith channel models other than AWGN, increasing the input

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    8/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 14. BER versus with varying input bitwidth.

    Fig. 15. Critical path delay versus input bitwidth for various constraint lengths(Radix-2).

    bitwidth beyond 4 bits still may make sense. However, the de-signer should always keep in mind, that increasing the softbitsparameter causes an increase in branch metric and path metricprecision (and area) and in most cases decreases the throughputas illustrated byFigs. 1517for radix-2 decoders with different

    values (conventional ACS, behavioral description).

    Fig. 15shows the affect of input bitwidth on the critical pathdelay for various . As expected, in majority of the cases thereis a linear increase in the critical path delay.

    Fig. 16displays the variation in area for the PMU with in-creasingsoftbitsfor various . A clear linear increase in area isvisible in all cases as well, since BM and PM precisions increaselinearly with the input bitwidth. This is because the number offull adders, multiplexers, and registers grow at the same rate.Consequently, the graph inFig. 16is a nice illustration of whatone would expect from basic theoretical analysis.

    The increase in area for lower values of is not very ap-parent inFig. 16. That is whyFig. 17was created to show thepercentage increase in area over the area of PMU for hard-input

    decoder. It also shows that that there is a significant area penaltywhen increasing the softbits for all values of . From(5) and

    Fig. 16. PMU area versus input bitwidth for various constraint lengths(Radix-2).

    Fig. 17. Increase in area versus input bitwidth for various constraint lengths(Radix-2).

    (6), one can conclude that PM precision increases with the log-arithm of . At the same time, it increases linearly with inputbitwidth, thus the influence of softbits on PM precision is morepronounced for lower values of . This explains the highest rel-ative increase in area for the PMU for .

    2) Radix-2 versus Radix-4: Fig. 18 shows the throughputachievable by using radix-2 and radix-4 decoder implemen-tations. These results are based on timing-driven synthesis toachieve the highest throughput figures possible. The resultingthroughputs are calculated by the synthesis tool, taking intoaccount all clock related issues like clock skew, uncertainty, in-sertion delay etc., before layout. To check the back-end affects,a number of layouts were produced as well. The results indicatethat on average, the deviation between pre- and post-layoutcritical path delay is less than 5%.

    When using the radix-4 technique, two output bits are pro-duced per clock cycle. Therefore, a higher throughput can beachieved even if the decoder is running at a lower clock fre-

    quency. Clock frequency becomes critical, especially in deepsub micron range. Higher clock frequency leads to cross talk

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    9/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    HABIBet al.: DESIGN SPACE EXPLORATION OF HARD-DECISION VITERBI DECODING 9

    Fig. 18. Maximum throughput versus constraint length for Radix-2 andRadix-4 .

    Fig. 19. PMU area versus constraint length for Radix-2 and Radix-4 .

    issues and higher power consumption besides decreasing signalreliability. That is why radix-4 is potentially a suitable choice indeep submicron domain, especially if high throughput (e.g.,700 Mb/s) is targeted without pushing the edge on the operatingclock frequency. However, especially for higher values, areapenalty of radix-4 compared to radix-2 can become quite signif-

    icant asFig. 19shows. Alternatively, radix-4 technique can beused to achieve higher throughput at a higher area cost.Figs. 20 and 21 show the relative increase in area and

    throughput for radix-4 PMU compared to radix-2 PMU fordifferent values of . Since the logic of radix-4 ACS is morecomplex compared to radix-2 ACS, the maximum clock fre-quency it can run at is lower. This implies that in case of anoptimization towards highest achievable throughput, a radix-4ACS is not exactly a factor of 2 faster than a radix-2 ACS asone would expect in the case of synthesis under relaxed timingconstraints. Still depending on constraint length, throughputcan be increased by 4070% (see Fig. 21) provided that thePMU area penalty of a factor 3.44.2 (see Fig. 20) can be

    afforded. Especially when radix-2 designs cannot meet thethroughput requirements, this factor is a reasonable price to

    Fig. 20. Area ratio between Radix-4 and Radix-2 versus (timing optimizedsynthesis, ).

    Fig. 21. Throughput ratio between Radix-4 and Radix-2 versus constraintlength .

    pay. Note that for , comparable results were reported in[7]. Since the PMU is only one part of the Viterbi decoder, theoverall penalty paid in terms of silicon area should be scaledaccordingly (the area breakdown of a whole Viterbi decoder for

    will be presented below).

    Figs. 2225summarize throughput and area tradeoffs in theradix-2 and radix-4 PMU design for different constraint lengthsand input bitwidths.

    To prove our findings on the PMU design, radix-2 and radix-4Viterbi decoders for , andwere implemented. The trace-back depth selection was donebased on the simulations for a wireless fading channel undervery noisy conditions (the value results from a baseband designproject carried out by one of the authors). As known from lit-erature, in case of AWGN channel, lower values for TBD, e.g.,56 K, would suffice. The synthesis was ran for worst-case mil-itary conditions. The post-layout netlists were simulated withtiming to determine the maximum possible throughput. Two

    cases referenced as constrained and relaxed synthesis were con-sidered to get more objective figures. Constrained synthesis runs

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    10/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 22. PMU delay for different values of and input bitwidth (Radix-2).

    Fig. 23. PMU delay for different values of and input bitwidth (Radix-4).

    Fig. 24. PMU area for different values of and input bitwidth (Radix-2).

    Fig. 25. PMU area for different values of

    and input bitwidth (Radix-4).

    with tightest timing constraints settings possible to obtain a de-coder with maximum throughput. Relaxed synthesis sets min-imal timing constraints to get the lowest area figure. The results

    are summarized inTable IIIwhich also shows the overall areaof the Viterbi decoder. Based on these figures, radix-4-based

    TABLE IIIDESIGNCOMPARISON OF THECONSTRAINED ANDRELAXEDSYNTHESIS OF

    RADIX-2 AND RADIX-4 DECODER

    Viterbi decoder achieves 55% and 100% higher throughput at afactor of 2 and 1.8 larger area for timing driven and relaxed syn-thesis respectively. Thus area penalty is affordable compared tothe increase in throughput.

    Table IVshows the detailed area breakdown for all sub-unitsof the Viterbi decoders for timing-driven and relaxed synthesis.It is clearly visible that in case of highly constrained synthesis,the worst area penalty occurs in the PMU. Due to the criticalpath in the PMU, timing constraint synthesis requires morehardware-intensive transformations. The BMU, trace-backlogic, and the LIFO are much less critical in terms of delayand can meet the constraints without significant area penalty.During the timing-driven (constrained) synthesis, the radix-2PMU becomes twice as large and the radix-4 more than 3times larger in area compared to the relaxed case. In general,the radix-4 PMU leaves the synthesis tool somewhat morefreedom for timing optimization since its logic is deeper andmore complex than radix-2.

    A factor that favors the radix-4 ACS is that it allows to achievethe same throughput at a lower clock frequency. This helps

    the other modules of the Viterbi decoder such as the survivormemory unit (SMU). The reason is that the trace-back basedSMU design requires one or more SRAM(s) and the use of fastSRAM results in high area and power consumption. The designof the SMU unit is discussed in the next section.

    Note that power dissipation aspects of the PMU have not beendiscussed so far. As it will be shown later, the power consump-tion of the PMU is a comparatively small fraction of the overallpower budget for a Viterbi decoder. That is why, for the designof the ACS architecture minimizing power consumption shouldnot play a dominating role. For low-power decoders, generaloptimization techniques to reduce power dissipation like clockgating can be applied to any of the PMU architectures intro-duced in this section[30],[31].

    V. SURVIVORSEQUENCEUPDATE ANDSTORAGEUNIT

    A. Design Space

    The survivor sequence update and storage unit (briefly calledsurvivor memory unit, SMU) receives decisions from the PMUand produces the decoded sequence.Fig. 26shows the designspace for the SMU implementation. The implementation tech-niques for this unit can be divided in two classes: standard andadaptive.

    TBD determines the number of memory access operations re-

    quired by the SMU. The value of the TBD parameter dependson the transmission channel conditions (or the channel model

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    11/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    HABIBet al.: DESIGN SPACE EXPLORATION OF HARD-DECISION VITERBI DECODING 11

    TABLE IVAREACOMPARISON(INMILLIMETERSSQUARED) BETWEENCONSTRAINED

    ANDRELAXEDSYNTHESIS OFRADIX-2AND RADIX-4 DECODER

    Fig. 26. Design space for SMU.

    used in the simulation). On noisy channels, trace-back pathsneeds longer to merge, so (since usually a Viterbi decoder isdesigned for a worst case) TBD values become larger. On theother side, with fixed trace-back depth and good channel con-ditions (higher SNR), SMU may run redundant memory accessoperations, since paths tend to merge faster. Adaptive SMU im-plementations are used to reduce the number of these redundant

    accesses.There are a number of adaptive techniques mentioned in the

    literature. Although they all require additional hardware to beadded to the SMU, on average power dissipation of the unit canbe reduced due to a lower number of memory accesses. In[32],[33], the adaptive techniques known as M-algorithm and T-al-gorithm are discussed, which reduces the number of states toconsider by eliminating the states with too large path metric,basically dynamically adopting the number of ACS to calcu-late and trace-back paths to store. This idea is further devel-oped in [39] as adaptive approximation by adding a pruning unitto the decoders core. The pruning unit limits the number of

    trace-back paths to follow, again aiming at power consumptionreduction. Moreover, an SMU with dynamic prediction com-bined with trace-forward technique (see below) is introduced in[34]for Viterbi decoders used in wireless applications, whichreduces the number of memory access by more than 70% in cer-tain cases.

    It is quite difficult to give general guidelines for the usage ofadaptive techniques in the SMU, since their positive affects arevery channel/application dependent (as mentioned in all refer-ences). For some designs, a very noisy channel can even lead tosome data loss in the decoder, so additional hardware is requiredfor recovery[32]. The only way to get figures for a specific de-sign are experiments and simulation with accurate channel char-

    acteristics in every single application case. Unfortunately, allreferences known to the authors only include simulation data for

    Fig. 27. Simple register exchange network corresponding to the code intro-duced inFig. 1.

    Fig. 28. General trace-back scheme[11].

    AWGN channels, so the claimed power saving figures are ques-tionable if, e.g., wireless channels with multipath and fadingare considered. That is why the scope of this paper is limitedto standard nonadaptive SMU designs. Nevertheless, for Viterbidecoder designs with strong focus on power reductions adaptiveSMU is definitely something to consider, provided the designeris willing to spend some time on simulating the decoder withproper channel models. A detailed overview of adaptive trellis

    decoding techniques can be found in[40].For standard (nonadaptive) Viterbi decoder there are two

    major techniques to implement the SMU: register exchange(RE) and trace-back (TB). RE uses a set of multiplexers andregisters to store and update the survivor sequence of eachstate for trace back depth (TBD) trellis stages. It is highlyregular, area efficient and has low latency. But, due to frequentswitching of states in the registers (bits are physically shiftingfrom one stage to the next), it consumes a lot of power espe-cially for larger values of and TBD. The structure of theRE network repeats the trellis of the convolutional code as itis shown inFig. 27for a simple code. Since it can be

    assumed that after TBD cycles, all survivor paths have merged,the output of any register from the last stage of the RE networkcan be used as the output of the Viterbi decoder.

    The TB stores the decisions in an SRAM and reads them inreverse order during the trace-back cycle. As the update rate ofthe memory is much less frequent than the update rate of theregisters in the RE technique, the power consumption is signif-icantly reduced.

    TB is split in three major operations. These are write, merge,and decode as shown inFig. 28. If a survivor path is traced backfor TBD stages, then the state at which all paths merge can be de-termined. Symbols traced back beyond this state can be used asdecoded output. In each cycle 1 column of decisions is written

    into memory and columns are read for trace-back and de-code. Pointers are used to keep track of the column read. The

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    12/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    1-pointer technique uses a single pointer multiple times to readcolumns. Multiple pointers are used in -pointer technique

    for reading -columns. The -pointer technique requires in-efficient and expensive multi-port or multi-bank memory, so itis rather unsuitable for a VLSI implementation.

    Trace forward (TF) technique [4] is used to eliminate the

    merge stage (i.e., the buffer referred as the merge block inFig. 28) which estimates the starting state in decode operation.This technique uses a set of registers that store the encoded state.In every clock cycle, the decision bits coming from the PMU areused to update these registers with new states corresponding tothe previous trellis stage. After TBD clock cycles, all registersare expected to converge to the same state, which is at the sametime the starting state for the decode operation. A refined ver-sion of this technique is introduced in[37].

    Pre-traceback (PTB) [4] is a technique used to reducememory access frequency. Instead of writing decisions at eachstage processed by the PMU, -stages are pre-traced andwritten as one composite decision. Thus, only one column of

    write operations is required for every trellis stages processedby the PMU. The pre-trace is implemented by a RE network oftypically 3 or 4 stages.

    Note that especially for small values of TBD, some trace-backtechniques like trace-forward can show lower BER performancethan the conventional trace-back[40].

    All techniques mentioned above involve several accesses tothe memory in the same clock cycle. This implies some designdecisions to be made with respect to the memory architecture.Memory can be organized as multi-port or single port with in-creased access rate. In case the access rate is too high, severalsingle port memories can be interleaved in a ping-pong manner.

    Particular memory architecture strongly depends on the actualdesign case (e.g., availability of the multi-port memory, max-imum clock speed and width of the memory, etc.).

    Throughput and area figures for the SMU can be calculatedquite easily. For RE and TB, the critical path delay is determinedby the flipflop toggle rate and SRAM access time respectively.The area for both increases linearly with TBD and exponentiallywith

    (15)

    where is the number of trace-back memory blocks (4 withstraightforward single-port-memory approach, 3 with folded

    read-write access, 2 if trace-forward is used, 1 for RE), isan area of a flip-flop plus multiplexer for RE or an area of anSRAM bit for TB respectively.

    As can be seen, the area and critical path delay of the SMUare fixed by the technology. The most important decision for thedesigner is the architecture choice between RE and TB whichis mainly driven by power consumption. Therefore the furtheranalysis is completely dedicated to the power dissipation as-pects.

    B. Implementation

    Simulations of power dissipation were ran for a radix-2

    Viterbi decoder with TBD of 128. The middle column ofTable Vshows the corresponding power dissipation profile. As

    TABLE VPOWERDISSIPATIONBREAK-DOWN FORRADIX-2, , VITERBIDECODERWITH RUNNING AT500 MHZWITH2-LEVELPTB AND AT167

    MHZUSINGREGISTEREXCHANGE

    TABLE VICELLAREA ANDPOWERDISSIPATION OFDIFFERENTPARTS OF THE

    TRACE-BACKSMU

    TABLE VIIDESIGNCHOICES FORDECISIONMEMORY

    the decision memory in the SMU is synthesized using a stan-dard cell library, it has a high power consumption (110.9 mWcompared to 6.1 mW for the rest of the SMU). Therefore forlow power designs, the SMU should be optimized e.g., usinglow-power low-leakage SRAMs.

    Such SRAM modules are available with power dissipation aslow as 0.0324 mW/MHz. Using these memories in the Viterbidecoder fromTable Vwould bring the power dissipation downto 22.3 mW at 250 MHz access frequency. In addition, it can be

    observed from Table VI that PTBoccupy only 1.7% of the SMUarea and dissipates only 1.9% of the total SMU power. Thismakes PTB an attractive solution to reduce the access frequencyof the decision memory which enables the usage of low powerhigh density SRAMs operating at a lower frequency.

    The right column ofTable Vshows power dissipation profilefor the Viterbi decoder using the RE technique for the SMU.In this case, 72% of the total power is dissipated in the SMU.Compared to the TB implementation which costs 57% even withnon-optimal standard cell based memories, this is quite high. Asdiscussed above, the size of the RE network is determined byand TBD. Therefore, we can conclude that for high values ofand TBD, RE is not a suitable design choice in terms of power.

    As already discussed, for decoders running at very highspeed, SRAMs of matching clock frequency may not be

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    13/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    HABIBet al.: DESIGN SPACE EXPLORATION OF HARD-DECISION VITERBI DECODING 13

    TABLE VIIICRITERIA FOROPTIMIZATION

    TABLE IXSUMMARY OF THECHARTS, FIGURES,AND TABLES

    available. The memory architecture depends on the frequencyat which the PMU is running and the maximum access fre-quency of the SRAMs.Table VIIIsummarizes the guidelinesfor the memory architecture choice depending on the clockfrequencies. denotes the maximum access frequency for theSRAMs. Operation frequency of the PMU is denoted by ,

    which implies the frequency at which the decision columns aregenerated.

    VI. CONCLUSION

    In this paper, a comprehensive analysis of the Viterbi decoderdesign space is presented. Table VII summarizes the importanceof the different subunits of the decoder depending on the op-timization criteria (the BMU is not mentioned, since its influ-ence on area, power and throughput is negligible). As shown inTable VII, the PMU is critical for throughput while the SMU iscritical for latency and power consumption.

    The most significant contributions of the paper can be sum-

    marized as follows: detailed presentation of the design space for the hard-deci-sion Viterbi decoder and each of its subunits;

    comparison of different adder architectures for the PMUimplementation (ripple-carry, carry-lookahead, and bit-se-rial);

    quantitative comparison of behavioral and structural HDLcoding styles;

    quantitative comparison of different ACS architectures; inparticular, it has been shown that the general assumptionof CSA being better than ACS and the radix-4 techniquebeing inefficient in terms of area does not hold for somecases;

    analysis of the influence of timing constraints (relaxed vs.constrained synthesis) on the design choices;

    detailed analysis of the impact of constraint length andinput bitwidth on area and throughput;

    detailed analysis of different SMU implementation tech-niques and their impact on power consumption based onpost-layout simulation.

    For a better overview of the material,Table IXsummarizes thefigures, charts, and tables related to different design aspects to

    help the reader in finding the proper guidelines for choosing theViterbi decoder design.

    So far, this work discusses most of the known VLSI imple-mentation techniques for the hard-decision Viterbi algorithmin standard cell CMOS technology and carefully analyzes thetradeoffs and dependencies between different design decisions.To the best of authors knowledge, this is the most comprehen-sive analysis of hard-decision Viterbi algorithm VLSI imple-mentation based on actual designs including post-layout exper-iments published so far.

    ACKNOWLEDGMENT

    The authors would like to thank Dr. N. Engin for valuable dis-cussions on hardware-software tradeoffs in Viterbi implementa-tion and A. Vaassen for supporting the layout experiments.

    REFERENCES[1] H. D. Lin and D. G. Messerschmit, Algorithms and architectures for

    concurrent Viterbi decoding, in Proc. IEEE Int. Conf. Commun., Jun.1989, pp. 836840.

    [2] G. Fettweis and H. Meyr, High rate Viterbi processor: A systolic arraysolution,IEEE J. Sel. Areas Commun., vol. 8, no. 8, pp. 15201534,Oct. 1990.

    [3] G. Feygin, G. Gulak, and F. Pollar, Survivor sequence memory man-agement in Viterbi decoder, in Proc. IEEE Int. Sym. Circuits Syst., Jun.1991, vol. 5, pp. 29672970.

    [4] P. J. Black and T. H.-Y. Meng, Hybrid survivor path architectures for

    Viterbi decoders, inProc. Int. Conf. Acoust. Speech, Signal Process. ,1993, vol. 1, pp. 433436.

    [5] M. Boo, F. Arguello, J. D. Bruguera, R. Doallo, and E. L. Zapata,High-performance VLSI architecturefor the Viterbi algorithm,IEEETrans. Commun., vol. 45, no. 2, pp. 168176, Feb. 1997.

    [6] K. K. Parhi, An improved pipelined MSB-first add-compare selectunit structure for Viterbi decoders, IEEE Trans. Circuits Syst. I, Reg.Papers, vol. 51, no. 3, pp. 504511, Mar. 2004.

    [7] Y. Engling, S. A. Augsburger, W. R. Davis, and B. Nicolic, A500-mb/s soft-output Viterbi decoder, IEEE J. Solid-State Circuits,vol. 38, no. 7, pp. 12341241, Jul. 2005.

    [8] G. Fettweis and H. Meyr, High speed parallel Viterbi decoding: Al-gorithm and VLSI architecture,IEEE Commun. Mag., vol. 29, no. 5,pp. 4655, May 1991.

    [9] H. M. H. Dawid and G. Fettweis, A CMOS IC for gb/s Viterbi de-coding: System design and VLSI implementation, IEEE Trans. Very

    Large Scale Integr. (VLSI) Syst., vol. 4, no. 1, pp. 1731, Mar. 1996.[10] R. D. Wesel, Wiley Encyclopedia of Telecommunications. New York:Wiley, 2003, ch. Convolutional Codes.

    Authorized licensed use limited to: The University of Toronto. Downloaded on July 13, 2009 at 10:08 from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/13/2019 Design Space Exploration of Hard-Decision Viterbi Decoding Algorithm and VLSI Implementation

    14/14

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    [11] P. J. Black and T. H. Y. Meng, A 140-mb/s, 32-state, radix-4 Viterbidecoder, IEEE J. Solid-State Circuits, vol. 27, no. 12, pp. 18771885,Dec. 1992.

    [12] I. Lee and J. L. Sonntag, A new architecture for the fast Viterbi al-gorithm communications,IEEE Trans. Commun., vol. 51, no. 10, pp.16241628, Oct. 2003.

    [13] G. Fettweis, H. Dawid, and H. Meyr, Minimized method Viterbi de-coding: 600 Mb/s per chip, in Proc. GLOBECOM 90, Dec. 1990, vol.

    3, pp. 17121716.[14] G. Fettweis and H. Meyr, Cascaded feedforward architectures for par-allel Viterbi decoding, in Proc. IEEE Int. Sym. Circuits Syst., May1990, pp. 978981.

    [15] P. J. Black and T. H. Y. Meng, A 1-Gb/s, four-state, sliding blockViterbi decoder, IEEE J. Solid-State Circuits, vol. 32, no. 6, pp.797805, Jun. 1997.

    [16] G. Fettweis and H. Meyr, Parallel Viterbi decoding by breaking thecompare-select feedback bottleneck,IEEE Trans. Commun., vol. 37,no. 8, pp. 785790, Aug. 1989.

    [17] C. B. Shung, H.-D. Ling, R. Cypher, P. H. Siegel, and H. K. Thapar,Area-efficient architectures for the Viterbi algorithm part I: Theory,

    IEEE Trans. Commun. , vol. 41, no. 4, pp. 636644, Apr. 1993.[18] C. B. Shung, H.-D. Ling, R. Cypher, P. H. Siegel, and H. K. Thapar,

    Area-efficient architectures for the Viterbi algorithm part II: Applica-tions, IEEE Trans. Commun., vol. 41, no. 5, pp. 802807, May 1993.

    [19] T. Jarvinen, P. Salmela, T. Sipila, and J. Takala, Systematic approach

    for path metric access in Viterbi decoders,IEEE Trans. Commun., vol.53, no. 5, pp. 755759, May 2005.

    [20] M. Benaissa and Y. Zhu, A novel high-speed configurable Viterbi de-coder for broadband access,J. Appl. Signal Process. EURASIP JASP,no. 13, pp. 13171327, 2003.

    [21] J. J. Kong and K. K. Parhi, K-nested layered look-ahead method andarchitectures for high throughput Viterbi decoder, in Proc. IEEE Work-shop Signal Process. Syst., Aug. 2003, pp. 99104.

    [22] J. J. Kong and K. K. Parhi, Low-latency architectures forhigh-throughput rate Viterbi decoders, IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 12, no. 6, pp. 642651, Jun. 2004.

    [23] V. Madisetti, VLSI Digital Signal Processors: An Introduction toRapid Prototyping and Digital Synthesis. Newton, MA: Butter-worth-Heinemann, Jun. 1995.

    [24] Y. N. Chang, H. Suzuki, and K. K. Parhi, A 2-Mb/s 256-state 10-mwrate-1/3 Viterbi decoder,IEEE Trans. Very Large Scale Integr. (VLSI)Syst., vol. 35, no. 6, pp. 826834, Jun. 2000.

    [25] G. Fettweis and H. Meyr, A 100 mbit/s Viterbi decoder chip: Novelarchitecture and its realization, in Proc. IEEE Int. Conf. Commun.,Apr. 1990, vol. 3, pp. 463467.

    [26] V. S. Gierenz, O. Weiss, T. G. Noll, I. Carew, J. Ashley, and R.Karabed, A 550 Mb/s radix-4 bit-level pipelined 16-state 0.25-mCMOS Viterbi decoder, in Proc. IEEE Int. Conf. ASAP, 2000, pp.195201.

    [27] T. Gemmeke, M. Gansen, and T. G. Noll, Implementation of scalablepower and area efficient high-throughput Viterbi decoders, IEEE J.Solid-State Circuits, vol. 37, no. 7, pp. 941948, Jul. 2002.

    [28] A. K. Yeung and J. M. Rabaey, 210 Mb/s radix-4 bit-level pipelinedViterbi decoder, in Proc. IEEE Int. Solid-State Circuit Conf. , Feb.1995, pp. 8889.

    [29] A. Hekstra, An alternative to metric rescaling in Viterbi decoders,IEEE Trans. Commun. , vol. 37, no. 11, pp. 12201222, Nov. 1989.

    [30] S. Ranpara, On a viterbi decoder design for low power dissipation,

    Masters thesis, Bradley Dept. Elect. Comput. Eng., Virginia Polytech.Inst. State Univ., Blacksburg, Apr. 1999.

    [31] I. Bogdan, M.Munteanu, P.A. Ivey, N.L. Seed, and N.Powell,Powerreduction techniques for a viterbi decoder implementation, in Proc.3rd Int. Workshop (ESDLPD), Rapello, Italy, Jul. 2000, pp. 59.

    [32] M. H. Chang, W. T. Lee, M. C. Lin, and L. G. Chen, IC design of anadaptive Viterbi decoder,IEEE Trans. Consum. Electron., vol. 42, no.1, pp. 5262, Feb. 1996.

    [33] E. Boutillon and L. Gonzalez, Trace back technques adapted to thesurviving memory managementin the m algorithm, inProc. Int. Conf.(ICASSP), Istanbul, Turkey, Jun. 2000, pp. 33663369.

    [34] C.-C. Lin, Y. H. Shih, H. C. Chang, and C. Y. Lee, Design of apower-reduction Viterbi decoder for WLANapplications,IEEE Trans.Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 11481156, Jun. 2005.

    [35] G. Forney, The Viterbi algorithm, Proc. IEEE, vol. 61, no. 3, pp.268278, Mar. 1973.

    [36] G. Feygin and P. G. Gulak, Architectural tradeoffs for survivorsequence memory management in Viterbi decoders, IEEE Trans.Commun., vol. 41, no. 3, pp. 425429, Mar. 1993.

    [37] Y. Gang, A. T. Erdogan, and T. Arslan, An efficient pre-tracebackarchitecture for the Viterbi decoder targeting wireless communicationapplications,IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 9,pp. 19181927, Sep. 2006.

    [38] G. Fettweis, Algebraic survivor memory management designfor Viterbi decoders, IEEE Trans. Commun., vol. 43, no. 9, pp.24582463, Sep. 1995.

    [39] R. Henning and C. Chakrabarti, An approach for adaptively approx-

    imating the Viterbi algorithm to reduce power consumption while de-coding convolutional codes,IEEE Trans. Signal Process., vol. 52, no.5, pp. 14431451, May 2004.

    [40] M. Kamuf, Trellis decoding: From algorithm to flexible archi-tectures, Ph.D. dissertation, Dept. Electrosci., Lund Univ., Lund,Sweden, Mar. 2007.

    Irfan Habibreceived the B.S. degree in electronicsengineering from Zakir Husain College of Engi-neering and Technology, Aligarth Muslim University(AMU), Aligarh, in 2004 and the M.S. degree inVLSI design, tools, and technology from IndianInstitute of Technology, Delhi, India, in 2006.

    He is with NXP Semiconductors, Eindhoven,The Netherlands, where he designs high speedIOs. Previously, he worked on Viterbi decoderdesigns at Philips Research Labs., Eindhoven, The

    Netherlands. His research interests include VLSIarchitectures and implementation of digital communication techniques andhardware-software codesign.

    zgn Paker (M08) received the B.Sc. degree inelectronics and communication from the TechnicalUniversity of Istanbul, Istanbul, Turkey, in 1996 andthe M.Sc. degree in computer systems and the Ph.D.degree in information technology from TechnicalUniversity of Denmark, Lyngby, Denmark, in 1996and 2002, respectively.

    He is currently with ST-Ericsson, Eindhoven, TheNetherlands. In 2003, he joined Philips Research asa Senior Scientist mainly involved in research sup-porting the semiconductor division of Philips Elec-

    tronics B.V. In 2006, he joined the Research organization of NXP Semiconduc-tors that spun out from Philips. His research interests include low power, lowcost VLSI implementation of programmable DSP systems. In particular, cur-rently his focus is in key baseband processing involved in wireless connectivityand cellular applications that employ MIMO technology as well as software-de-fined radio architectures enabling multi-channel, multi-standard real-time base-band processing. He is author or coauthor of several distinguished scientificpublications including journal and conference papers and has 6 US and interna-tional patents and pending patent applications.

    Dr. Paker is a member of the Computer and Communications Society of theIEEE.

    Sergei Sawitzki(S00M02) was born in Dresden,Germany, in 1975. He received the M.Sc. degree incomputer science and the Ph.D. degree (with honors)in computer engineering from Dresden University of

    Technology, Dresden, Germany, in 1998 and 2002,respectively.

    In 2001, he joined Philips Research Laboratories,Eindhoven, the Netherlands, where he was active foralmost6 years as a Research Scientistand SeniorSci-entist. From 2006to 2008, he wasworkingas a SeniorScientist for the Research Division of NXP Semicon-

    ductors, Eindhoven, The Netherlands. In October 2008, he joined the facultyof FH Wedel University of Applied Sciences in Wedel, Germany, where he iscurrently teaching electronics and digital design. His research interests includereconfigurable computing, VLSI architectures and design, and high-speed dig-ital signal processing for digital communication systems. He is the author orcoauthor of more than 40 peer-reviewed scientific publications including bookchapters, journal and conference papers. He has ten pending U.S. and interna-tional patents and patent applications.

    Dr. Sawitzki is a member of program committees of Field ProgrammableLogic and Applications Conference and IEEE Computer Society Symposiumon VLSI as well as several other international conferences and workshops. Heis a member of the IEEE Computer Society.