[IEEE 2005 2nd International Symposium on Wireless Communication Systems - Siena, Italy (05-09 Sept. 2005)] 2005 2nd International Symposium on Wireless Communication Systems - Array Processors for Viterbi Decoder

  • Published on

  • View

  • Download

Embed Size (px)


  • Array Processors for Viterbi Decoder

    Appaya Devaraj S and Nastooh AvesstaDept. of Information Technology

    University of TurkuTurku, Finland

    e-mail: {adswam, nastooh.avessta} @utu.fi

    Abstract- Wireless receivers are often characterized as portableand battery operated. As such, they are bound by a tight set ofconstraints such as power consumption, area usage, andthroughput speed. Parallel implementation of operationsincreases the speed of computation without an undue increase inclock frequency. Thus, high throughput is achieved withoutexcessive power consumption. The apparent tradeoff inthroughput improvement, of parallel implementation, is thelarger area usage. Hence, there is a need to find an optimalimplementation in terms of area, speed and complexity. In thispaper, an exhaustive set of parallel implementations of Viterbidecoder is developed, by different arrangements of ProcessingElements (PEs) and schedules. Objective comparison amongvarious implementations is performed to select the optimalimplementation.

    Keywords- Spatial-temporal mapping; array processor; Viterbidecoder; design space e-xploration


    Low power and high performance are essential factors inwireless communication. Advances in VLSI technology haveincreased both available room and maximum achievable clockfrequency in a chip. To obtain higher throughput, designersmay rely either on parallel implementation or higher clockfrequency. However, the latter suffers from higher powerconsumption and cross talk anomalies [1]. An alternatesolution is to increase the computational speed throughparallelism. Pipelining is the simplest form of machineparallelism that is universally applied in all types of computingsystem. Beyond simple pipelining, there are several othermethods to exploit parallelism in order to improve performance[2].

    Convolutional codes are widely used in wirelesscommunication, as exemplified by IS-95 and 3G [3]. TheViterbi algorithm is commonly used for decodingconvolutional codes in digital communications [4]. Due to itsprominent role in digital communication, a plethora ofoptimized implementations have been considered. Forexample, [5] explains matrix-vector multiplication schemesand strongly connected trellis for efficient implementation ofthe Viterbi algorithm. While [6] describes matrix multiplicationtype implementation without strongly connected trellis,exploiting dual-dimensional parallelization for the Viterbialgorithm. In [7], 3 levels of parallelism for high speedimplementation are presented. Instead of matrix multiplicationtype architecture, unidirectional ring architecture is presentedin [8]. Reference [9] summarieses the previous approaches andpresents perfect shuffle concepts for the Viterbi

    implementation. In contrast to the works mentioned above, inthis paper various parallel implementation of the Viterbidecoder with different architectures and schedules are realizedand evaluated. The objective is to select the optimalimplementation.

    Section 2 and Section 3 provide an overview of the Viterbialgorithm and array architectures, respectively. Exhaustive setsof mapping and scheduling for the implementation arediscussed in Section 4. Section 5, provides the evaluations ofthe various drafted architectures. Observations and conclusionsare presented in Section 6.

    II. VITERBI DECODER SETUPThe Viterbi algorithm finds the most likely state transition

    in a state diagram, given a sequence of received symbolscorrupted by noise [10]. In Fig. 1, the states and output/input ofencoder are shown inside the circles and on the arc,respectively. The state transition considered here can bevisualized by the trellis diagram shown in Fig. 2. The statetransitions in trellis are annotated on the first column of Fig. 2.The transitions due to inputs 0 and 1 are indicated for the firsttime step in Fig. 2. At each time index, there are three majorsteps in the Viterbi detection.

    1. To generate branch metrics for all possible transitionsamong all pairs of states in the given trellis.

    2. To update path metric, obtained by the addition of branchmetrics to the corresponding previous path metrics, for allstates. The pair of paths that leads to a state is compared andthe path with the minimum value is updated as a survivor pathin the survivor matrix.

    3. The survivor matrix is traced L (window size) stagesback to determine the most likely transitions.

    In this paper Hamming distance is used as the distancemeasure to find the most likely sequence of symbols.

    III. ARRAY PROCESSORSParallel algorithm expressions may be derived by two

    approaches.1) Vectorisation of the sequential algorithm expressions.

    2) Spatial-temporal mapping.Vectorising compilers may not be sufficiently effective as

    Spatial-temporal mapping in extracting the inherentconcurrency of the operations. Therefore, it is advantageous to

    0-7803-9206-X/05/$20.00 2005 IEEE


  • employ parallel expressions such as Dependence Graphs (DG),or Signal Flow Graphs (SFGs) to describe an algorithm, whilstexploiting algorithmic concurrencies [11]. The next step is tomap parallel expressions into a suitable processor array. Arrayprocessors are used for high speed processing and to reduce theinherent complexity in the design. Array architectures can beSIMD arrays, MIMD arrays, systolic arrays etc.

    IV. ARCHITECTURES FOR VITERBI DECODERSMany research works, such as [5] and [6], map DGs

    directly to systolic arrays, but in this paper we adopt DGmapping to SFG arrays [11]. The main reasons are that theSFG defines array structures with minimum timing constraintsand that formal transformations from SFG to a systolic arraycan be developed [11]. The two steps involved in mapping aDG to an SFG array are processor assignment and scheduling.The nodes of the DG along the projection vector d are assignedto a common PE, where the projection maps the DG onto alower dimensional lattice of points known as processor spaceor base.


    The scheduling vector s specifies the sequence ofoperations. A plane normal to s is referred to as a hyperplane(HP). The schedule is permissible if and only if all thedependency arcs flow in the same direction across thehyperplanes and the hyperplanes are not parallel to theprojection vector d [11]. In this paper, the trellis diagramshown in Fig. 2, is considered as the DG and differentimplementation types (SFGs) for the Viterbi decoding areobtained by changing the projection and the schedule vectors.In the various implementation types discussed in this section,shown in Fig. 3, Fig. 4 and Fig. 5, nodes in the trellis areprojected onto a processor base as shown by the dashed linesand connections among various PEs in projected nodescorrespond to the arc numbers in the trellis diagram. Fig. 3 and

    Trellis diagram Projected nodes




    r 01/1

    Figure 3. Type-I, d and s Along X-axis

    Tedilpam Projeted noesFigure 1. State Diagram for 4-state, 1/2 Convolution Code

    Ttellis diagiam

    Figure 4. Type-II, d Along X-axis and s is DiagonalFigure 2. Trellis Diagram



    0o 4

    01 4


    11 1

  • Fig. 4, show Type I and Type II, Type III and Type IV areshown in Fig. 5, where sl and s2 indicate the schedules ofType III and Type IV, respectively. In Type I, d and s arealong the X-axis, the same direction as that of the dataflow.In Type III d is diagonal and s is along the X-axis. In Type Iand Type II, d is in the same direction. Likewise d is samefor Type III and Type IV resulting in a similar architecture.However, different schedules lead to different computationalclock cycles.

    V. EVALUATIONIn Type I, d is along X-axis, so the processor base has

    4PEs, and the number ofPEs are independent of the depth ofthe trellis. However in Type III, d is diagonal to the trellis, so

    Ns2 d


    the processor base has 10 PEs, each corresponding to thedashed lines from the trellis, and the number of PEs aredependent on the depth of the trellis. In Type II, s is diagonalto the trellis and in Type I, s along X-axis, which gives 10and 7 hyperplanes ( indicated by HPi in Fig. 4 ), respectively.The time taken to complete the whole operation depends onthe number of hyperplanes, as the PEs in the varioushyperplanes are scheduled one after the other. Solid circles inFig. 6 and Fig. 7 indicate the active PEs, at the correspondingtime step for the various Types. All the PEs on the samehyperplane are executed at the same time, but if someelements need inputs from previous elements that are yet tobe executed, they must wait. For example, at the 3rd clockcycle, PEs 3, 6, 9, in Fig. 4, were to be executed (as they areall in HlP3), but element 6 needs input from 4 and element 9


    Figure 5. Type- III & IV, d is Diagonal and sl Along X-axis and s2 diagonal


    1 2 3 4Time

    TYPE I


    5 6i 7

    0 Q 0 *: a * a * * *O Q * * O * * *' 0 *O * O O * a* * 0* 0* 0OX a * *

    1 2 3 4 5 6 7 8Time

    TYPE 11

    Figure 6. Active PEs vs Clock Cycle for Type I and III




    9 10

    0 0: 9 0 0 0 00. 0 0 0 * 00 0 0 0 * 00 0 0 0 0 0 0

  • needs input from 6, which were not executed before the 3rdclock cycle. Therefore, element 6 is executed at the 5th clockcycle and at the next clock (6th clock cycle) element 9 isexecuted. The length of the longest path in the DG is thelower bound for the total time required for executing thealgorithm, independent of the number of PEs used [12]. Thelongest path in trellis diagram (Fig. 2) takes 7 clock cycles.Clock cycles taken by all the four types mentioned are equalto the number of hyperplanes, respectively. The performanceof various types are tabulated in Table I, where the averageprocessor utility is calculated using (1).

    VI. OBSERVATIONS AND CONCLUSIONBy rearranging the projected nodes in Type I and III a

    square array (Fig. 8) is obtained, which is similar to the statediagram in Fig. 1. Path metrics are calculated for all thestates at every clock cycle to get the survivor matrix.Therefore, in Type I all the 4 PEs are active in all the clockcycles (Fig. 6 Type I), which leads to 100 percent processorutility. Total computation time in Type I is 7 which isminimum, hence it is the optimal implementation.

    Avgut,sation =Total Number of Active PEs * 100

    Total Number of PEsWhere the total number of active PEs and total numbePEs are considered for the window size 7 (Fig. 2), andbe calculated from Fig. 6 and Fig. 7.


    Type I II III IV

    Total Clock 7 10 7 10Cycles

    No. of PE's 4 4 10 10

    Avg.Processor 100 70 40 28utility

    From Table I it is noted that the processor utility is(1) higher for types, which have similar structure as the state

    diagram. It should also be noted that in Type I the schedulevector is in the same direction as the dataflow in the trellis.

    ,r of Furthermore, as the schedule vector changes from thecan

    F'E-1 \U .11

    PE-2 0

    PE -3

    PE -4

    Figure 8. Square Array by Rearranging Type I

    0 0 0 0 0 0 0

    0 0 0 0 0 0 0

    0 0 0 0 0 00

    0 0 0 0 0 0

    0 0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0 0

    0 0 0 0 0 0 0


    0 0 0 0 0 0 0



    PE 5



    1 2 3 4 5 6 7Time


    0 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 00 0 0 00 0 0 00 0 0 00 0 0 0

    1 2 3 4

    0 0 0 0 0 00 0 0 0 0 00 0 0 0 00

    0 0 0 0 0

    0 0 0 0 0 0

    0 00

    0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    5 6 7 8Time


    9 10

    Figure 7. Active PEs vs Clock Cycle for Type II and IV








    1* \'0. Nooo


  • direction of the dataflow, as in Type IIl, the processor utilitydecreases. Future studies will consider addition of fault-tolerance for types that are not fully utilized.

    In general, a change in the projection vector affects thearea needed for implementation and a change in the schedulevector affects the total time needed for the computation. Inthis paper, various admissible parallel implementation ofViterbi decoders were systematically drafted and Type I wasfound to be the optimal implementation.


    [1] William J. Dally and John W. Poulton, Digital Systems Engineering,Cambridge University Press, March 1998.

    [2] Krste Asanovic, Vector Microprocessors, University of California atBerkeley, May 1998.

    [3] Ranpara, S., Dong Sam Ha, A low-power Viterbi decoder design forwireless communications applications, ASIC/SOC Conference, 1999.Proceedings. Twelfth Annual IEEE International, 15-18 Sept. 1999,Page(s):377 381.

    [4] John G. Proakis, Digital Communications, Mcgraw-Hill College , 3rdedition , March 1, 1995.

    [5] Chang, C.-Y., Yao, K., Systolic array processing of the Viterbialgorithm, Information Theory, IEEE Transactions on, vol. 35, Issue:1, Jan. 1989, pp.76 - 86.

    [6] Kuei Ann Wen, Jau Yien Lee, Parallel processing for Viterbialgorithm, Electronics Letters, Volume: 24, Issue: 17, 18 Aug. 1988,pp. 1098 - 1099.

    [7] Fettweis, G., Meyr, H., High-speed parallel Viterbi decoding:algorithm and VLSI-architecture, Communications Magazine, IEEE,vol. 29, Issue: 5, May 1991, pp. 46 - 55.

    [8] W. Bliss, J. Girard, J. Avery, M. Lightner and L. Scharf, A modulararchitecture for dynamic programming and maximum likelihoodsequence estimation, Acoustics, Speech, and Signal Processing, IEEEInternational Conference on ICASSP '86., vol. I1, Apr. 1986, pp. 357- 360.

    [9] Gulak, P.G., Kailath, T., Locally connected VLSI architectures for theViterbi algorithm, Selected Areas in Communications, IEEE Journalon , vol. 6, Issue: 3, April 1988, pp. 527 - 537.

    [10] Lou, H.-L, Implementing the Viterbi algorithm, Signal ProcessingMagazine, IEEE, vol. 12, Issue: 5, Sept. 1995, pp. 42 - 52.

    [I1] S.Y Kung, VLSI Array Processors, Prentice-Hall, 1988.[12] S.K. Rao, T. Kailath, Regular iterative algorithms and their

    implementation on processor arrays, Proc. IEEE, vol.76, Mar. 1988,pp. 259 - 269.



View more >