94
박사학위논문 Doctoral Thesis 4세대 이동통신을 위한 다표준 지원 고성능 터보 디코더의 설계 구현 Design and Implementation of High-Performance Radix-4 Turbo Decoder for Multiple 4G Standards (Kim, Ji-Hoon) 전자전산학부 전기 전자공학전공 School of Electrical Engineering & Computer Science Division of Electrical Engineering KAIST 2009

2009D020047150_S1Ver2

Embed Size (px)

DESCRIPTION

fasdfasdf df 4434 5553

Citation preview

  • Doctoral Thesis

    4

    Design and Implementation of

    High-Performance Radix-4 Turbo Decoder

    for Multiple 4G Standards

    ( Kim, Ji-Hoon)

    School of Electrical Engineering & Computer Science

    Division of Electrical Engineering

    KAIST

    2009

  • 4

    Design and Implementation of

    High-Performance Radix-4 Turbo Decoder

    for Multiple 4G Standards

  • Design and Implementation of

    High-Performance Radix-4 Turbo Decoder

    for Multiple 4G Standards

    Advisor: Professor Park, In-Cheol

    by

    Kim, Ji-Hoon

    School of Electrical Engineering & Computer Science

    Division of Electrical Engineering

    KAIST

    A thesis submitted to the faculty of the KAIST in partial

    fulfillment of the requirements for the degree of Doctor of

    Philosophy in Engineering in the School of Electrical

    Engineering and Computer Science, Division of Electrical

    Engineering

    Daejeon, Korea

    2009. 4. 29.

    Approved by

    Professor In-Cheol Park

    Professor Park, In-Cheol

    Advisor

  • 4

    .

    2009 4 29aa

    ()

    ()

    ()

    ()

    ()

  • .

    d

    ddfasdf

  • DEE

    20047150 . Kim, Ji-Hoon. Design and Implementation of High-

    Performance Radix-4 Turbo Decoder for Multiple 4G Standards. 4

    . School of Electrical Engineering & Computer Science, Division of

    Electrical Engineering. 2009. 85 p. Advisor Prof. Park, In-Cheol. Text in

    English

    Abstract

    Recently, turbo codes have been adopted for high-speed data transmission of the 4G

    communications systems such as Mobile WiMAX (IEEE 802.16e standard) and 3GPP-

    LTE in the form of the double-binary and the single-binary, respectively. Especially,

    double-binary convolutional turbo code (CTC) shows superior advantages over the

    classical single-binary turbo codes. However, compared with the classical single-binary

    turbo code, nonbinary turbo code is much more complex in hardware implementation and

    its decoding requires more memory especially for storing the extrinsic information to be

    exchanged between the two soft-input soft-output (SISO) decoders. Additionally, due to its

    iterative decoding behavior, implementing a high-performance turbo decoder for next-

    generation mobile communication systems becomes challenging. Also, as the need to

    support multiple standards in a single mobile handheld device increases, the efficient

    implementation of the advanced channel decoders, which is the most area-consuming and

    computationally intensive block in baseband modem, becomes more important.

    In order to deal with these issues in resource limited handheld systems, this dissertation

    presents several solutions from every aspect algorithm, architecture, and implementation.

    As an algorithmic solution, two techniques are proposed, which are especially suitable for

    nonbinary / high-radix single-binary turbo decoding. The first one, an energy-efficient

    SISO decoder based on border metric encoding, eliminates the complex dummy

    calculation at the cost of a small-sized memory that holds encoded border metrics. Due to

    the infrequent accesses to the border memory and its small size, the energy consumed for

    SISO decoding is reduced hugely. As the second one, to reduce the memory size required

  • for double-binary turbo decoding, a new method to convert the symbolic extrinsic

    information to the bit-level information and vice versa is presented. By exchanging the bit-

    level extrinsic information, the number of extrinsic information values to be exchanged in

    double-binary turbo decoding is reduced to the same amount as single-binary turbo

    decoding. Since the size of the extrinsic information memory is significant, the proposed

    method is effective in reducing the total memory size needed in double-binary turbo

    decoder.

    Based on the proposed algorithmic solutions, to verify the proposed methods, two chips

    have been implemented. The first implemented chip contains a double-binary turbo

    decoder for the mobile WiMAX standard with the dedicated hardware interleaver and

    fabricated using a 0.13m CMOS process. The proposed decoder is based on the time-

    multiplexing architecture consisting of a single optimized SISO decoder, a low-complexity

    hardware interleaver, and it can provide up to 50Mb/s at the frequency of 200MHz with

    simple early stopping criterion exploiting the bit-level extrinsic information. The second

    chip presents the unified radix-4 turbo decoder architecture which can support both Mobile

    WiMAX and 3GPP-LTE. To exhibit a decoding rate of more than 100Mb/s, the proposed

    chip consists of eight retimed radix-4 SISO decoders and a dual-mode parallel hardware

    interleaver to support both standards. The second chip can show more than 400Mb/s at the

    frequency of 250MHz with simple early stopping criterion. The proposed chip can achieve

    an energy efficiency of 0.34nJ/bit/iteration while achieving more than 100Mb/s with fixed

    eight iterations when the supply voltage is scaled since the peak operating frequency is

    relatively high due to the retiming technique.

  • 1

    Contents

    CHAPTER 1 INTRODUCTION ......................................................... 8

    1.1 Motivation................................................................................................................ 8

    1.2 Previous Works ........................................................................................................ 9

    1.3 Contributions .......................................................................................................... 11

    CHAPTER 2 BACKGROUNDS .........................................................14

    2.1 Digital Communication System ..............................................................................14

    2.2 Introduction to Turbo Codes ..................................................................................16

    2.2.1 Turbo Code Encoder Structure ...........................................................................17

    2.2.2 Turbo Decoding .................................................................................................19

    2.2.3 Decoding Algorithm for Turbo Codes .................................................................19

    2.3 Turbo code in Mobile WiMAX ...............................................................................27

    2.3.1 Encoding ...........................................................................................................28

    2.3.2 Decoding ...........................................................................................................29

    2.4 Turbo code in 3GPP-LTE .......................................................................................32

    2.4.1 Encoding ...........................................................................................................32

    2.4.2 Decoding ...........................................................................................................34

    CHAPTER 3 BORDER METRIC ENCODING ................................35

    3.1 Radix-4 SISO Decoding ..........................................................................................35

    3.1.1 Sliding Window for nonbinary SISO Decoding...................................................36

    3.2 Proposed Border Metric Encoding .........................................................................38

    3.3 Experimental Results..............................................................................................42

  • 2

    CHAPTER 4 BIT-LEVEL EXTRINSIC INFORMATION

    EXCHANGE...........................................................................................44

    4.1 Extrinsic Information in Double-Binary Turbo Codes ..........................................44

    4.1.1 Symbol-level Extrinsic Information in Double-Binary Turbo Codes ....................44

    4.1.2 Memory Requirement in Double-Binary Turbo Decoder .....................................45

    4.2 Proposed Bit-Level Extrinsic Information Exchange ............................................47

    4.2.1 Bit-level Extrinsic Information for Double-Binary Turbo Codes .........................47

    4.2.2 Symbol-to-Bit Conversion of Extrinsic Information ............................................50

    4.2.3 Bit-to-Symbol Conversion of Extrinsic Information ............................................51

    4.3 Experimental Results..............................................................................................53

    4.3.1 Hardware Implementation ..................................................................................55

    CHAPTER 5 A 50MBPS DOUBLE-BINARY CIRCULAR TURBO

    DECODER FOR MOBILE WIMAX ....................................................58

    5.1 Proposed Chip Architecture ...................................................................................58

    5.1.1 Low-Complexity SISO Decoder Design .............................................................59

    5.1.2 Bit-level Extrinsic Information Exchange ...........................................................60

    5.1.3 Dedicated Hardware Interleaver .........................................................................60

    5.1.4 Dedicated Double-Flow Hardware Interleaver ....................................................63

    5.1.5 Early Stopping Criterion ....................................................................................64

    5.2 Implementation Results ..........................................................................................65

    CHAPTER 6 A UNIFIED PARALLEL RADIX-4 TURBO

    DECODER FOR MOBILE WIMAX AND 3GPP-LTE ........................69

    6.1 Proposed Chip Architecture ...................................................................................69

    6.1.1 Parallel Turbo Decoding.....................................................................................70

    6.1.2 Unified Radix-4 SISO Decoder with Retiming ...................................................71

    6.1.3 Memory-Sharing with Bit-level Extrinsic Information ........................................74

    6.1.4 Dual-Mode Hardware Interleaver .......................................................................75

    6.2 Implementation Results ..........................................................................................76

  • 3

    CHAPTER 7 CONCLUSIONS ...........................................................80

    REFERENCE .........................................................................................82

  • 4

    List of Figures

    Figure 1.1: The Need for Supporting Multiple Standards ............................................ 9

    Figure 1.2: Research Overview ....................................................................................12

    Figure 1.3: Proposed Solutions for Nonbinary CTC Decoder Implementation ..........13

    Figure 2.1: Model of a digital communication system .................................................15

    Figure 2.2: Turbo code encoder structure . ..................................................................17

    Figure 2.3: A turbo decoder structure ..........................................................................20

    Figure 2.4: Double-binary CRSC constituent encoder used by WiMAX ....................29

    Figure 2.5: A decoder for the WiMAX turbo code .......................................................30

    Figure 2.6: A Turbo Encoder for 3GPP-LTE ...............................................................33

    Figure 2.7: A Trellis Diagram of a 3GPP-LTE Turbo Encoder ....................................33

    Figure 3.1: Trellis Diagrams .........................................................................................36

    Figure 3.2: Sliding window diagrams...........................................................................37

    Figure 3.3: 3-bit border metric encoding function .......................................................39

    Figure 3.4: BER performance comparison with 8 iterations for 4800-bit frame. .......40

    Figure 3.5: BER performance of 1920-bit frame according to the number of iterations

    ......................................................................................................................................41

    Figure 4.1: Memory Requirements in Double-Binary Turbo Decoder ........................47

    Figure 4.2: Block diagram of the proposed bit-level extrinsic information exchange .48

    Figure 4.3: Proposed Bit-Level Extrinsic Information Exchange ...............................54

    Figure 4.4: Comparison of BER performance of 8 iterations for 1920-bit frame. ......54

  • 5

    Figure 4.5: Block diagram of the proposed double-binary turbo decoder ..................56

    Figure 4.6: Block diagram and complexity of the proposed bit-to-symbol converter .56

    Figure 5.1: Branch metric memory width comparison ................................................60

    Figure 5.2: Block diagram of the proposed two converters .........................................61

    Figure 5.3: Interleaving procedure for the WiMAX ....................................................61

    Figure 5.4: Interleaver structure based on the incremental calculation ......................62

    Figure 5.5: Need of LIFO for Interleaved Address ......................................................63

    Figure 5.6: Double-flow hardware interleaver based on incremental calculation.......64

    Figure 5.7: Double-flow hardware interleaver based on incremental calculation.......65

    Figure 5.8: Block diagram of the proposed double-binary turbo decoder ..................65

    Figure 5.9: Average number of iterations for the proposed turbo decoder .................66

    Figure 5.10: Comparison of BER performance for 1920-bit frame.............................66

    Figure 5.11: Die photo of the proposed double-binary turbo decoder chip .................67

    Figure 6.1: Overall Unified Turbo Decoder Architecture with Time-Multiplexing .....70

    Figure 6.2: The Proposed Chip Architecture with Eight SISO Decoders ....................71

    Figure 6.3: Add-Compare-Select (ACS) block with Retiming .....................................72

    Figure 6.4: Sliding Window with Register Retiming ...................................................73

    Figure 6.5: Input Frame Memory Configurations .......................................................74

    Figure 6.6: Dual-Mode Dedicated Hardware Interleaver............................................76

    Figure 6.7: FER Performance and Average Iteration Number with Early Termination

    in an AWGN Channel ...................................................................................................78

    Figure 6.8: Memory Size Reduction in the Proposed Architecture .............................78

    Figure 6.9: Micrograph of the Chip .............................................................................79

  • 6

    List of Tables

    Table 1.1 Differences between the 3GPP-LTE and Mobile WiMAX Turbo Codes......10

    Table 3.1 Simulation environment ...............................................................................39

    Table 3.2 Encoded values for border metrics ...............................................................40

    Table 3.3 Single-port SRAM size required for a SISO decoder ...................................43

    Table 3.4 Energy consumptions of SISO decoders.......................................................43

    Table 4.1 Simulation environment ...............................................................................46

    Table 4.2 Memory Configuration for one SISO Decoder ............................................46

    Table 4.3 Memory Configuration for the Extrinsic Information .................................46

    Table 4.4 Single-port SRAM Size required for the Turbo Decoder .............................55

    Table 5.1 CTC Interleaver Parameters for WiMAX....................................................62

    Table 5.2 Single-port SRAM Size Required for the Turbo Decoder ............................67

    Table 6.1 Comparison of Decoder Implementation .....................................................77

  • 7

    List of Abbreviations

    4G: 4th Generation

    RSC: Recursive Systematic Convolutional

    CTC: Convolutional Turbo Code

    SISO: Soft-Input Soft-Output

    LLR: Log Likelihood Ratio

    ML: Maximum Likelihood

    SOVA: Soft-Output Viterbi Algorithm

    MAP: Maximum a posteriori

    APP: a posteriori Probability

    ECC: Error Correction Coding

    FEC: Forward Error Correction

    BPSK: Binary Phase Shift Keying

    OFDMA: Orthogonal Frequency Division Multiple Access

    NLOS: Non-Link-of-Sight

    AWGN: Additive White Gaussian Noise

    ARP: Almost Regular Permutation

    QPP: Quadratic Polynomial Permutation

  • 8

    Chapter 1

    Introduction

    The turbo code introduced in 1993 is one of the most powerful forward error correction

    channel codes, and provides near optimal bit-error rates (BERs), that is, within 0.5 dB of

    Shannons limit at BER of 10-5 [1]. Having this remarkable performance, the turbo codes

    have been accepted in many standardized mobile radio systems.

    Recent advance in convolutional turbo code (CTC) attracts much interest in its

    applications. Conventional CTC suffers from high error floor due to its relative small

    minimum Hamming distance and suffers from performance degradation due to puncturing.

    Nonbinary CTC has recently emerged and it seems to solve many flaws of classical single-

    binary CTC [2]. In addition, the concept of tail-biting convolutional code has been applied

    to CTC. The tail-biting code called circular code improves the spectral efficiency of CTC

    since it solves the problem of tail bits used to terminate the state of the encoder.

    Recently, turbo codes have been adopted for high-speed data transmission of the 4G

    mobile communication systems such as Mobile WiMAX (IEEE 802.16e standard) and

    3GPP-LTE in the form of the double-binary and the single-binary, respectively.

    1.1 Motivation

    There has been little research dedicated to the hardware implementation of the double-

    binary turbo decoder although the previous works on the classical single-binary turbo

    codes can be applied to the nonbinary turbo codes [4]-[11]. Compared with the classical

    single-binary turbo code, nonbinary turbo code is much more complex in hardware

    implementation and its decoding requires more memory especially for storing the extrinsic

    information to be exchanged between the two soft-input soft-output (SISO) decoders.

  • 9

    In addition, as the need to support multiple standards in a single handheld device

    increases as shown in Figure 1.1, the efficient implementation of the advanced channel

    decoders, which is the most area-consuming and computationally intensive block in

    baseband modem, becomes more important. Accordingly, the unified decoder architecture

    which can support multiple standards becomes necessary since the separate

    implementations for different standards require much hardware resources leading to huge

    silicon area occupation. Since the turbo codes adopted in 3GPP-LTE and Mobile WiMAX

    are different from each other as denoted in Table 1.1, the efficient implementation of the

    unified turbo decoder to support both 3GPP-LTE and Mobile WiMAX is important for

    future mobile hand-held devices. Also, due to its iterative decoding behavior and long

    critical path, implementing a high-performance turbo decoder for next-generation mobile

    communication systems becomes challenging.

    1.2 Previous Works

    There have been studies on double-binary turbo decoding to lower the hardware

    Figure 1.1: The Need for Supporting Multiple Standards

  • 10

    complexity [12]-[14]. For a double-binary SISO decoding algorithm, based on the

    maximum a posteriori (MAP) algorithm [1], the constant log-MAP algorithm has been

    reported for double-binary turbo decoding [12]. By allowing the constant correction term

    in log-MAP algorithm for double-binary SISO decoding, a performance improvement was

    observed.

    Due to the tail-biting property, the initial values of the forward metric and backward

    metric are not explicitly specified. In [13], the simple method to determine the initial state

    in circular turbo decoding is presented. It has been reported that using the information of

    the previous iteration shows better performance and lower computational complexity than

    the pre-computing method [15].

    To reduce the huge extrinsic information memory size, two techniques have been

    introduced in [14]. The first one, bitwise approximation for extrinsic information, can

    reduce three extrinsic information into two extrinsic information in double-binary turbo

    decoding by modifying the SISO decoding structure. However, it leads to a severe

    performance degradation of BER performance, about larger than 0.5dB. Also, it is well

    known that non-uniform quantization can be applied to reduce the extrinsic information

    memory size since the extrinsic information does not need to be exact in decoding [4][5].

    By exploiting this property, the second technique uses block-scaling method where a

    common shift index is used for three extrinsic information values. This method can reduce

    Table 1.1 Differences between the 3GPP-LTE and Mobile WiMAX Turbo Codes

    Standards 3GPP-LTE Mobile WiMAX

    RSC

    code

    Type Single-Binary Double-Binary

    Constraint

    Length 4 4

    Trellis Termination

    Appending the bits that

    make both encoder states all zero and sending the

    resulting codes

    Tail-Biting (Circular Coding)

    Interleaver

    Type QPP Interleaver ARP Interleaver

    Frame size

    (N)

    40 8 , 0 59

    512 16 , 0 32

    1024 32 , 0 32

    2048 64 , 0 64

    f f

    f fN

    f f

    f f

    24, 36, 48, 72, 96, 108,

    120, 144, 180, 192, 216,

    240, 480, 960, 1440,

    1920, 2400 (pairs)

  • 11

    the extrinsic information memory size hugely with negligible performance degradation

    although the number of extrinsic information values is still three.

    In addition, there have been several turbo decoder implementations for single-binary

    turbo codes [9][20]. To support multiple 3G standards, such as CDMA2000 and W-CDMA,

    the programmable single-instruction multiple-data (SIMD) processor has been proposed

    for interleaving in order to provide interleaved data at the speed of the hardware SISO [20].

    Compared to the ROM-based interleaver which needs a large ROM to store all of the

    possible interleaved patterns, the proposed approach can achieve the small area, high

    performance, and low power consumption of hardware, as well as the flexibility and

    programmability of software needed to support multiple standards.

    Also, to support higher user data rates, up to 24Mb/s, a radix-4 log-MAP turbo decoder

    for 3GPP-HSDPA has been introduced in [9]. The log-MAP SISO decoder processes two

    received symbols per clock cycle using a windowed radix-4 architecture doubling the

    throughput for a given clock rate over a similar radix-2 architecture.

    1.3 Contributions

    The major contribution of this paper is to present the algorithmic modifications for

    low-complexity hardware implementation, architectural solutions and several

    optimizations for high-performance turbo decoding with the capability of supporting two

    4G communication standards as illustrated in Figure 1.2 and Figure 1.3. In other words, the

    contribution can be categorized as follows.

    The first one is the energy-efficient SISO decoding structure for nonbinary turbo

    decoders. With border metric encoding scheme, the complex dummy calculation in

    nonbinary turbo decoding can be avoided at the cost of a small-sized memory that holds

    encoded border metrics. Due to the infrequent accesses to the border memory and its small

    size, the energy consumed for SISO decoding is reduced hugely.

    The second one is to present the bit-level extrinsic information exchange. To reduce the

    memory size required for double-binary turbo decoding, a new method to convert the

    symbolic extrinsic information to the bit-level information and vice versa is presented. By

    exchanging the bit-level extrinsic information rather than the symbol-level extrinsic

  • 12

    information, the number of extrinsic information values to be exchanged in double-binary

    turbo decoding is reduced to the same amount as single-binary turbo decoding. Compared

    to bitwise approximation for extrinsic information in [14], the proposed method does not

    require any modifications to the conventional double-binary SISO decoder structure. The

    proposed method deals with the symbol-to-bit conversion and bit-to-symbol conversion of

    the extrinsic information for the double-binary turbo code. Since the size of the extrinsic

    information memory is significant, the proposed method is effective in reducing the total

    memory size needed in double-binary turbo decoder with negligible performance

    degradation.

    The third one is to present the whole decoder architecture of the double-binary circular

    turbo decoder for Mobile WiMAX. To lower the overall hardware complexity, in addition

    to the above methods, the dedicated hardware interleaver is designed. By generating the

    interleaved addresses on-the-fly, the proposed turbo decoder can achieve small area and

    low power consumption since there is no need to include a large-sized interleaver memory.

    Also, for the critical path delay reduction, a retimed architecture for double-binary SISO

    Figure 1.2: Research Overview

    Algorithm

    ArchitectureImplementation

    Nonbinary Max-log-MAP

    Border Metric Encoding

    Bit-level Extrinsic Info.

    ARP/QPP Interleaving

    Time-Multiplexing

    Parallel Turbo Decoding

    Unified SISO Decoding

    Memory Sharing

    2 Chips in 130nm CMOS

    Interconnect Issue

    Speed / Area Tradeoff

  • 13

    decoding is presented. In addition, to avoid unnecessary iterations at good channel

    environment, a simple early stopping criterion for double-binary turbo decoder is presented.

    The proposed stopping criterion uses the sign values of incoming bit-level extrinsic

    information and the hard-decision values.

    Finally, to support multiple 4G mobile communication systems such as Mobile

    WiMAX and 3GPP-LTE which require high-speed data transmission, the unified parallel

    radix-4 turbo decoder architecture is proposed.

    Figure 1.3: Proposed Solutions for Nonbinary CTC Decoder Implementation

    Bit-level Extrinsic Info. Memory Sharing

    Parallel Turbo Decoding Register Retiming

    loss ~ 0.15 dB No Error Floor

    WiMAX / 3GPP-LTE

    for Radix-4 Processing w/o memory

    Hardware Interleaver Border Metric Encoding Branch Metric Optimization

  • 14

    Chapter 2

    Backgrounds

    The efficient design of a communication system that enables reliable high-speed

    service is challenging. Efficient design refers to the efficient use of primary

    communication resources, namely, power and bandwidth. The reliability of such systems is

    usually measured by the required signal-to-noise ratio (SNR) to achieve a specific error

    rate. Also, a bandwidth efficient communication system with perfect reliability, or as

    reliable as possible, using as low as SNR as possible is desired.

    Error correction coding (ECC) is a technique that improves the reliability of

    communication over a noisy channel. The use of the appropriate ECC allows a

    communication system to operate at very low error rates, using low to moderate SNR

    values, enabling reliable high-speed multimedia services over a noisy channel.

    2.1 Digital Communication System

    The information source generates a message containing information that is to be

    transmitted to the receiver. In a digital communication system, shown in Figure 2.1, the

    outputs of the information source are converted into a sequence of bits. This sequence of

    bits might contain too much redundancy. Ideally, the source encoder removes redundancy

    and represents the source output sequence with as few bits as possible. Note that the

    redundancy in the source is different from the redundancy inserted intentionally by the

    error correcting code.

    The encrypter encodes the data for security purposes. Encryption is the most effective

    way to achieve data security. The tree components, information source, source encoder and

    encrypter can be seen as a single component called the source. The binary sequence is the

  • 15

    output of the source. The number of bits the source generates per second is the data rate

    and is in units of bits per second (bps or bits/s).

    The primary goal of the channel encoder is to increase the reliability of transmission

    within the constraints of signal power, system bandwidth and computational complexity.

    This can be achieved by introducing structured redundancy into transmitted signals.

    Channel coding is used in digital communication systems to correct transmission errors

    caused by noise, fading and interference. The channel encoder assigns to each message a

    longer message called a codeword. This usually results in either a lower data transmission

    rate or increased channel bandwidth relative to an un-coded system. To make the

    communication system less vulnerable to channel impairments, the channel encoder

    generates codewords that are as different as possible from one another.

    Since the transmission medium is a waveform medium, the sequence of bits generated

    by the channel encoder cannot be transmitted directly through this medium. The main

    goals of modulation are not only to match the signal to the transmission medium, enable

    simultaneous transmission of a number of signals over the same physical medium and

    increase the data rate, but also to achieve this by the efficient use of the two primary

    resources of a communication system, namely, transmitted power and channel bandwidth.

    A communication channel refers to the combination of physical medium (copper wires,

    radio medium or optical fiber) and electronic or optical devices (equalizers, amplifiers) that

    are part of the path followed by a signal as shown in Figure 2.1. Channel noise, fading and

    interference corrupt the transmitted signal and cause errors in the received signal. This

    thesis proposal considers only AWGN type channels, which ultimately limit system

    performance. Note that many interference sources and background noise can be modeled

    Figure 2.1: Model of a digital communication system

  • 16

    as AWGN due to the central limit theorem.

    At the receiving end of the communication system, the demodulator processes the

    channel-corrupted transmitted waveform and makes a hard or soft decision on each symbol.

    If the demodulator makes a hard decision, its output is a binary sequence and the

    subsequent channel decoding process is called hard-decision decoding. A hard decision in

    the demodulator results in some irreversible information loss. If the demodulator passes the

    soft output of the matched filter to the decoder, the subsequent channel decoding process is

    called soft-decision decoding.

    The channel decoder works separately from the modulator/demodulator and has the

    goal of estimating the output of the source encoder based on the encoder structure and a

    decoding algorithm. In general, with soft-decision decoding, approximately 2 dB and 6 dB

    of coding gain with respect to hard-decision decoding can be obtained in AWGN channels

    and fading AWGN channels, respectively.

    If encryption is used, the decrypter converts encrypted data back into its original form.

    The source decoder transforms the sequence at its input based on the source encoding rule

    into a sequence of data, which will be used by the information sink to construct an estimate

    of the message. These three components, decrypter, source decoder and information sink

    can be represented as a single component called the sink, as far as the rest of the

    communication system is concerned. The binary sequence is the input to the sink.

    2.2 Introduction to Turbo Codes

    It is well known from information theory that a random code of sufficient length is

    capable of approaching the Shannon limit, provided one uses maximum likelihood (ML)

    decoding. Unfortunately, the complexity of ML decoding increases with the size of

    codeword up to the point where decoding becomes impractical. Thus, a practical decoding

    of long codes requires that the code possess some structure. Coding theorists have been

    trying to develop codes that combine two seemingly conflicting principles: (a)

    randomness, to achieve high coding gain and so approach the Shannon limit, and (b)

    structure to make decoding practical. In 1993, Berrou et al. introduced a new coding

    scheme that combines these two seemingly conflicting principles in an elegant way. They

  • 17

    introduced randomness through an interleaver and structure by employing parallel

    concatenated convolutional codes. These codes are called turbo codes and offer an

    excellent tradeoff between complexity and error correcting capability. Concatenated codes

    are very powerful error correcting codes that are capable of closely approaching the

    Shannon limit by using iterative decoding [1].

    2.2.1 Turbo Code Encoder Structure

    A turbo code encoder consists of three building blocks: constituent encoders,

    interleavers and a puncturing unit. The constituent encoders are used in parallel and each

    interleaver scrambles the information symbols before feeding them into the corresponding

    constituent encoder. The puncturing unit is used to achieve higher code rates. In general,

    turbo codes can have more than two parallel constituent convolutional encoders, where

    each encoder is fed with a scrambled version of the information symbol u. Figure 2.2(a)

    shows the general architecture of turbo codes, where the outputs u, Pi (i = 1, , F) are

    known as the systematic part and the parity part, respectively. In practice, most

    applications use only two constituent encoders where only the input to the second encoder

    is scrambled as shown in Figure 2.2(b).

    (a)

    (b)

    Figure 2.2: Turbo code encoder structure. (a) General structure of turbo codes. (b)

    Typical structure of turbo codes.

  • 18

    2.2.1.1 The Constituent Encoders

    Turbo codes use recursive systematic convolutional (RSC) encoders. The use of

    recursive or feed-back encoders prevents the encoders from being driven back to all-zero

    state by zero symbols. Since u is permuted before entering ENC2, it is likely that one of

    the RSC code outputs will have high weight. This discussion does not mean that turbo

    codes exhibit very high minimum distances. In fact, achieving high minimum distances

    requires the use of a well designed interleaver of sufficient length. Finding such an

    interleaver is not trivial. The systematic part helps the iterative decoding to provide better

    convergence. Note that the systematic part prevents the turbo codes from being

    catastrophic if no data puncturing is involved. If the systematic part is punctured, two

    different input sequences can produce the same codeword making the codes catastrophic.

    Since repetition codes are not good codes, the systematic part from only one of the

    constituent encoders is transmitted.

    2.2.1.2 Interleaving

    Interleaving refers to the process of permuting symbols in the information sequence

    before it is fed to the second constituent encoder. The primary function of the interleaver is

    the creation of a code with good distance properties. Note that interleaving alone cannot

    achieve good distance properties unless it is used together with recursive constituent

    encoders. De-interleaving acts on the interleaved information sequence and restores the

    sequence to its original order.

    Achieving good distance properties is a common criterion for interleaver design. This

    fits very well with the concept of maximum likelihood (ML) decoding. Unfortunately,

    turbo decoding is not guaranteed to perform a ML decoding, because of the independence

    assumption made on the sequence to be decoded and the probabilistic information (known

    as extrinsic information) passed between constituent decoders. This suggests an additional

    design criterion based on the correlation between the extrinsic information.

  • 19

    2.2.1.3 Puncturing

    Puncturing refers to the process of removing certain bits from the codeword. The

    purpose of puncturing is to increase the overall code rate. It is common to puncture only

    the parity symbols of the first and second encoders.

    2.2.2 Turbo Decoding

    The iterative turbo decoding consists of two component decoders serially concatenated

    via an interleaver, identical to the one in the encoder, as shown in Figure 2.3.

    The first SISO decoder takes as input the received information sequence 1p

    ky and the

    received parity sequence generated by the first encoder 1p

    ky . The decoder then produces

    extrinsic information denoted as 1eL , which is interleaved and used to produce an

    improved estimate of the a priori probabilities of the information sequence for the second

    decoder.

    The other two inputs to the second SISO decoder are the interleaved received

    information sequence s

    ky and the received parity sequence produced by the second

    encoder 2p

    ky . The second SISO decoder also produces extrinsic information 2eL which is

    used to improve the estimate of the a priori probabilities for the information sequence at

    the input of the first SISO decoder. The decoder performance can be improved by this

    iterative operation relative to a single operation serial concatenated decoder. The feedback

    loop is a distinguishing feature of this decoder and the name turbo code is given with

    reference to the principle of the turbo engine. After a certain number of iterations the soft

    outputs of both SISO decoders stop to produce further performance improvements. Then

    the last stage of decoding makes a hard decision after de-interleaving the log likelihood

    ratio (LLR), denoted as Lr .

    2.2.3 Decoding Algorithm for Turbo Codes

    Turbo codes require SISO decoders to generate extrinsic information and LLR. Either

    maximum a posteriori (MAP) algorithm [1] or soft output Viterbi algorithm

  • 20

    (SOVA) can be used for the component decoders. MAP based Turbo decoders generally

    have much better performance than SOVA-based Turbo decoders. In this work, we focus

    on MAP algorithm.

    2.2.3.1 MAP Algorithm

    Let u = (u1, u2, , uN) be a set of binary variables representing information bits, where

    N denotes the frame size. In the systematic encoders, one of the outputs xs = (x1s, x2

    s,, xNs)

    is identical to the information sequence u. The other is the parity information sequence

    output xp = (x1p, x2

    p,, xNp). The noisy versions of outputs are ys = (y1

    s, y2s, , yN

    s) and yp =

    (y1p, y2

    p, , yNp). Let R1

    N = (R1, R2, , Rk, , RN) denote the received sequence, where Rk

    = (yks, yk

    p).

    We assume that binary phase shift keying (BPSK) modulation is used to map each binary

    symbol into a signal from the { +1, -1} modulation signal set. In the MAP decoder, the

    decoder decides whether uk = +1 or uk = -1 depending on the following log-likelihood ratio

    (LLR).

    1

    1

    ( 1 )( ) log

    ( 1 )

    N

    k

    R k N

    k

    P uL u

    P u

    R

    R (2.1)

    In the final operation, the decoder makes a hard decision by comparing LR(uk) to a

    threshold equal to zero, as shown in the expression (2.2).

    SISO

    Decoder 1

    InterleaverSISO

    Decoder 2

    Interleaver

    Deinterleaver

    Deinterleaver

    SKyP1Ky

    P2Ky

    SKy

    ~

    Output

    1eL 2eL

    2Lr

    Figure 2.3: A turbo decoder structure

  • 21

    R k1 if L (u ) 0

    0 otherwiseku

    (2.2)

    We can compute the APPs in (2.1) as

    N

    1 1 1

    N

    1 1

    1

    ( 1| ) ( ', | )

    ( ', , )

    N

    k k k

    U

    k k

    N

    U

    P u P S s S s

    P S s S s

    R R

    R

    R

    (2.3)

    where Sk is encoder state at time k, U+ is the set of pairs (s, s) for the state transitions (Sk-1

    = s ) (Sk = s) which correspond to the event uk = +1, and U- is similarly defined.

    Also

    N

    1 1 1

    N

    1 1

    1

    ( 0 | ) ( ', | )

    ( ', , )

    N

    k k k

    U

    k k

    N

    U

    P u P S s S s

    P S s S s

    R R

    R

    R

    (2.4)

    The log-likelihood ratio LLR is then

    N

    1 1

    N

    1 1

    ( ', , )

    ( ) log( ', , )

    k k

    U

    R k

    k k

    U

    P S s S s

    L uP S s S s

    R

    R (2.5)

    By several applications of Bayes rule, we have

    N k-1 N

    1 1 k k 1

    N k-1 k-1

    k 1 1 k 1 k

    N k-1 k-1 k-1

    k 1 1 k k 1 1

    N k-1

    k 1 k 1

    1

    ( ', , ) ( ', , ,R , )

    ( | ', , ,R ) ( ', , ,R )

    ( | ', , ,R ) ( ,R | ', ) ( ', )

    ( | ) ( ,R | ') ( ', )

    ( ) ( ', ) ( ')k k k

    P s s P s s

    P s s P s s

    P s s P s s P s

    P s P s s P s

    s s s s

    R R R

    R R R

    R R R R

    R R

    (2.6)

    The log-likelihood ratio LLR can be written as

    1

    1

    ( ') ( ', ) ( )

    ( ) ln( ') ( ', ) ( )

    k k k

    U

    R k

    k k k

    U

    s s s s

    L us s s s

    (2.7)

    where 1( ')k s is the forward metric, ( )k s is the backward metric and ( ', )k s s is the

    branch metric. They are defined as

    k

    1( ) ( , )k ks P S s R (2.8)

  • 22

    k 1( ', ) ( ,R | ')k k ks s P S s S s (2.9)

    N

    k 1( ) ( | )k ks P S s R (2.10)

    We can obtain ( )k s defined in (2.8) as

    k

    1

    k

    1

    '

    k-1 k-1

    k 1 1

    '

    k-1

    k 1

    '

    1

    '

    ( ) ( , )

    ( ', , )

    ( , R | ', ) ( ', )

    ( , R | ') ( ', )

    ( ', ) ( ')

    k

    s

    s

    s

    k k

    s

    s P s

    P s s

    P s s P s

    P s s P s

    s s s

    R

    R

    R R

    R

    (2.11)

    We can obtain ( )k s defined in (2.10) as

    N

    1 k

    N

    k

    s

    N

    k k k

    s

    N

    k k

    s

    s

    ( ') ( | ')

    ( , | ')

    ( | ', ,R ) ( ,R | ')

    ( | ) ( ,R | ')

    ( ) ( ', )

    k

    k k

    s P s

    P s s

    P s s P s s

    P s P s s

    s s s

    R

    R

    R

    R

    (2.12)

    The recursion for the ( )k s is initialized according to

    0

    1 s 0( )

    0 s 0s

    (2.13)

    which makes the reasonable assumption that the component encoder is initialized to the

    zero state. The recursion for the ( )k s is initialized according to

    1 s 0( )

    0 s 0N s

    (2.14)

    which assumes that termination bits have been appended at the end of the data word so

    that the component encoder is again in state zero at time N.

    All that remains at this point is the computation of k( ', ) ( ,R | ')k s s P s s . Observe that

    ( ', )k s s may be written as

  • 23

    k

    k

    k

    ( ', ) ( ', ,R )( ', )

    ( ') ( ', )

    ( | ') (R | ', )

    ( ) (R | )

    k

    k k

    P s s P s ss s

    P s P s s

    P s s P s s

    P u P u

    (2.15)

    where the event uk corresponds to the event s s. Note P(s|s) = P(s s ) = 0 if s is

    not a valid state from state s. Hence, ( , ') 0k s s if s s is not valid and, otherwise,

    2

    2

    ( )( ', ) exp[ ]

    22

    k kk

    k

    y xP us s

    (2.16)

    where it is assumed that codes are transmitted on an AWGN channel and 2 is noise

    variance.

    2.2.3.2 Max-log-MAP Algorithm

    In order to avoid the complexity of multiplication and division operation in (2.7), (2.11),

    and (2.12), the computations are converted into logarithmic domain. The metrics in the

    new domain are defined as follows:

    ( ) ln( ( ))k ks s (2.17)

    ( ) ln( ( ))k ks s (2.18)

    ( ) ln( ( ))k ks s (2.19)

    The expression (2.17) is rewritten as

    1s'

    1

    s'

    ( ) ln( ( ))

    ln( ( ') ( ', ))

    ln( exp( ( ') ( ', )))

    k k

    k k

    k k

    s s

    s s s

    s s s

    (2.20)

    These log-domain forward metrics are initialized as

    0

    0 s 0( )

    - s 0s

    (2.21)

    The expression (2.18) is rewritten as

  • 24

    1 1

    s

    ( ') ln( ( '))

    ln( exp( ( ) ( ', )))

    k k

    k k

    s s

    s s s

    (2.22)

    with initial conditions

    0 s 0( )

    - s 0N s

    (2.23)

    under the assumption that the encoder has been terminated.

    As before, the ( )R kL u is computed as

    1

    1

    1

    1

    ( ') ( ', ) ( )

    ( ) ln( ') ( ', ) ( )

    ln[ exp( ( ') ( ', ) ( ))]

    -ln[ exp( ( ') ( ', ) ( ))]

    k k k

    U

    R k

    k k k

    U

    k k k

    U

    k k k

    U

    s s s s

    L us s s s

    s s s s

    s s s s

    (2.24)

    These expressions can be simplified by using the expression.

    max( , ) ln( )x yx y e e (2.25)

    Given the max function, we may now rewrite (2.20), (2.22), and (2.24) as

    1'

    ( ) max[ ( ') ( ', )]k k ks

    s s s s (2.26)

    1( ') max[ ( ) ( ', )]k k ks

    s s s s (2.27)

    1

    1

    ( ) max[ ( ') ( ', ) ( )]

    -max[ ( ') ( ', ) ( )]

    R k k k kU

    k k kU

    L u s s s s

    s s s s

    (2.28)

    As shown in the above operations, the multiplications in the MAP are replaced by

    additions in the Max-log-MAP, which results in the low complexity of Max-log-MAP. The

    calculation of ( )k s will be given in Section 2.2.3.3.

    2.2.3.3 Calculation of Branch Metrics and Extrinsic Information

    The extrinsic information takes the role of a priori information in the iterative decoding

    algorithm.

  • 25

    ( 1)

    ( ) ln( )( 1)

    k

    e k

    k

    P uL u

    P u

    (2.29)

    The a priori term ( )kP u shows up in (2.16) in an expression for ( , ')k s s . In the log-

    domain, (2.16) becomes

    2

    2( ', ) ln ( ) ln( 2 )

    2

    k k

    k k

    y xs s P u

    (2.30)

    Now observe that we may write from (2.29)

    k

    exp[ ( ) / 2]( ) ( ) exp[ ( ) / 2]

    1 exp[ ( )]

    A exp[ ( ) / 2]

    e k

    k k e k

    e k

    k e k

    L uP u u L u

    L u

    u L u

    (2.31)

    where the first equality follows since it equals

    _

    _ k

    _

    _

    _ _ k

    _

    P /P( ) P /P P when u 1 and1 P /P

    P /P( ) P /P P when u 1,1 P /P

    (2.32)

    where we have defined ( 1)kP P u and ( 1)kP P u for convenience.

    Substitution of (2.31) into (2.30) yields

    2

    y( )( ', ) ln( / 2 )

    2 2

    k kk e k

    k k

    xu L us s A

    (2.33)

    where we will see that the first term may be ignored.

    Thus, the extrinsic information received from a companion decoder is included in the

    computation through the branch metric ( , ')k s s . The rest of algorithm proceeds as before

    using equations (2.26), (2.37) and (2.28).

    Using the fact that

    2 2 2

    2 2 2 2

    ( ) ( )

    ( ) 2 ( ) ( ) 2 ( )

    s s p p

    k k k k k k

    s s s s p p p p

    k k k k k k k k

    y x y x y x

    y x y x y x y x

    (2.34)

    and that only the terms dependent on U

    or U

    , 2s s

    k kx y and 2p p

    k kx y , survive after

    the subtraction (2.26), (2.27) and (2.28), (2.33) is rewritten as follows.

  • 26

    2 2( ' ) ( )

    s s p p

    k k k k

    k k e k

    x y x ys s u L u

    (2.35)

    Given 2

    2CL

    , we have

    ( ' ) ( )2 2

    s s p pC C

    k e k k k k k

    L Ls s L u x y x y (2.36)

    Upon substitution of (2.36) into (2.28), we have

    ( ) max[ ( ') ( ) ( ) ]2 2

    max[ ( ') ( ) ( ) ]2 2

    s s p pC C

    R k k e k k k k k kU

    s s p pC C

    k e k k k k k kU

    L LL u s L u x y x y s

    L Ls L u x y x y s

    (2.37)

    Now note that ( ) ( )2 2

    s s sC C

    e k k k e k k

    L LL u x y L u y under the first max( ) operation in

    (2.37) and ( )2 2

    s s sC C

    e k k k k

    L LL u x y y under the second max( ) operation. Using the

    definition for max(), it is easy to see that these terms may be isolated out so that

    ( ) ( ) max[ ( ') ( ) ]2

    max[ ( ') ( ) ]2

    s p pC

    R k C k e k k k k kU

    p pC

    k k k kU

    LL u L y L u s x y s

    Ls x y s

    (2.38)

    The interpretation of this new expression for ( )R kL u is that the first term is likelihood

    information received directly from the channel, the second is extrinsic likelihood

    information received from a companion decoder, and the third term ( max maxU U

    ) is

    extrinsic likelihood information to be passed to a companion decoder. Note that this third

    term is likelihood information gleaned from received parity not available to the companion

    decoder. Using notation , ( )e OUT kL u for extrinsic information to be passed and , ( )e IN kL u

    for extrinsic information received, we have

    , ,( ) ( ) ( )s

    R k e IN k C k e OUT kL u L u L y L u (2.39)

    Extrinsic information which will be passed to the companion decoder is calculated as

    follows.

  • 27

    , ,( ) ( ) ( )s

    e OUT k R k C k e IN kL u L u L y L u (2.40)

    2.3 Turbo code in Mobile WiMAX

    Mobile WiMAX is a rapidly growing broadband wireless access technology based on

    IEEE 802.16 standard [3]. It utilizes Orthogonal Frequency Division Multiple Access

    (OFDMA) as the radio access method for improved multipath performance in non-line-of-

    sight (NLOS) environment and promise to deliver high data rates over large areas to a

    large number of uses in the near future. This exciting addition to current broadband options

    such as DSL, cable, and Wi-Fi promises to rapidly provide broadband access to locations

    in the worlds rural and developing areas where broadband is currently unavailable, as well

    as competing for urban market share.

    Recently, to improve system gain and non-line-of-sight (NLOS) coverage, double-

    binary tail-biting convolutional turbo code (CTC) has been adopted in IEEE 802.16

    standard (WiMAX) with its superior advantages over the classical single-binary turbo code.

    Double-binary turbo codes double the decoding rates in a hardware implementation,

    because they allow memory access of two bits at each time instant. The reason for

    such decoding rates is the fact that the extrinsic information, which must be passed

    to the next decoder after interleaving or de-interleaving, represents two bits at each

    time instant. Doubling the decoding rates leads to a reduction in the latency of the

    decoder by one half.

    Double-binary turbo codes reduce the sensitivity to puncturing. This can be

    explained as follows. Since the rate 1/2 double-binary recursive systematic

    convolutional (RSC) encoder produces two parity streams, most of the code rates

    can be obtained by simply ignoring one of these parity streams and puncturing the

    other (if necessary). Ignoring one of the two parity streams results in a new RSC

    encoder with a single parity stream. This single parity stream is less punctured

    compared to similar single-binary convolutional RCS encoders, which results in

  • 28

    less sensitivity to puncturing.

    Double-binary turbo codes reduce the correlation effects between component

    decoders, which leads to improved convergence [3].

    For practical purpose, it is important to reduce the computational complexity of turbo

    decoding. An approach for reducing the computational complexity of maximum a

    posteriori (MAP) decoding [1] has been introduced in the previous section for single-

    binary turbo codes, where there are only two branches entering and leaving each state.

    As opposed to single-binary codes where only two branches enter and leave each state,

    in double-binary turbo codes there are four branches entering and leaving each state.

    2.3.1 Encoding

    The CSRC constituent encoder used by WiMAX is shown in Figure 2.4. The encoder is

    fed blocks of k message bits which are grouped into N = k/2 couples. In Figure 2.4, A

    represents the first bit of the couple, and B represents the second bit. The two parity bits

    are denoted W and Y. For ease of exposition, subscripts are left off the figure, but below a

    single subscript is used to denote the time index k {0, , N-1} and an optional second

    is used on the parity bits W and Y to indicate which of the two constituent encoders

    produced them.

    Let the vectors Sk = [Sk,1 Sk,2 Sk,3]T, Sk,m {0,1} denote the state of the encoder at time

    k. Note that although the input s and outputs of the encoder are defined over GF(4), only

    binary values are stored within the shift register and thus the encoder has just eight states.

    The encoder state at time k is related to the state at time k

    k+1 k kS =GS +X (2.41)

    where

    1 0 1

    1 0 0

    0 1 0

    k k

    k k

    k

    A B

    B

    B

    X G (2.42)

    Because of the tailbiting nature of the code, the block must be encoded twice by each

  • 29

    constituent encoder. During the first pass at encoding, the encoder is initialized to the all-

    zeros state, S0 = [0 0 0]T. After the block is encoded, the final state of the encoder SN is

    used to derive the circulation state

    N -1

    c NS =(I+G ) S (2.43)

    where the above operations are over GF(2). In practice, the circulation state Sc can be

    found from SN by using a lookup table [3]. Once the circulation state is found, the data is

    encoded again. This time, the encoder is set to start in state Sc and will be guaranteed to

    also end in state Sc.

    The first encoder operates on the data in its natural order, yielding parity couples {Wk,1,

    Yk,1}. The second encoder operates on the data after it has been interleaved.

    2.3.2 Decoding

    Decoding is complicated by the fact that the constituent codes are double-binary and

    circular. As with conventional turbo codes, decoding involves the iterative exchange of

    extrinsic information between the two component decoders. While decoding can be

    performed in the probability domain, the log-domain is preferred since the low complexity

    Max-log-MAP algorithm can then be applied. Unlike the decoder for a single-binary turbo

    code, which can represent each binary symbol as a single log-likelihood ratio, the decoder

    for a double-binary code requires three log-likelihood ratios. For example, the likelihood

    ratios for message couple (Ak, Bk) can be represented in the form

    Figure 2.4: Double-binary CRSC constituent encoder used by WiMAX

  • 30

    ,

    ( , )( , ) log

    ( 0, 0)

    k k

    a b k k

    k k

    P A a B bA B

    P A B

    (2.44)

    where (a, b) can be (0, 1), (1, 0), or (1, 1).

    An iterative decoder that can be used to decode the WiMAX turbo code is shown in

    Figure 2.5. The goal of each of the two constituent decoders is to update the set of log-

    likelihood ratios associated with each message couple. In the figure and in the following

    discussion, ( ), ( , )ia b k kA B denotes the set of LLRs corresponding to the message couple at

    the input of the decoder and ( ), ( , )oa b k kA B is the set of LLRs at the output of the decoder.

    Each decoder is provided with ( ), ( , )ia b k kA B along with the received values of the parity

    bits generated by the corresponding encoder (in LLR form). Using these inputs and

    knowledge of the code constraints, it is able to produce the updated LLRs ( ), ( , )oa b k kA B at

    its output.

    As with single-binary turbo codes, extrinsic information is passed to the other

    constituent decoder instead of the raw LLRs. This prevents the positive feedback of

    previously resolved information. Extrinsic information is found by simply subtracting the

    appropriate input LLR from each output LLR, as indicated in Figure 2.5.

    The extrinsic information that is passed between the two decoders must be interleaved

    or de-interleaved so that it is in the proper sequence at the input of the other decoder.

    Figure 2.5: A decoder for the WiMAX turbo code

  • 31

    2.3.2.1 Max-log-MAP Algorithm for Decoding

    The extension of Max-log-MAP algorithms to the double-binary case is fairly

    straightforward. In the double-binary turbo codes, the three log-likelihood ratio outputs of

    the k-th symbol are expressed as follows.

    1

    1

    ( )

    1 1 1 1( , )

    1 1 1 1( ,00)

    max ( ) ( ) ( )

    max ( ) ( ) ( )

    k k

    k k

    z

    k k k k k k k ks s z

    k k k k k k ks s

    s s s s

    s s s s

    (2.45)

    where z belongs to {01,10,11} , sk is the state of an encoder at time k, and , and

    are the forward, backward, and branch metrics, respectively. The metrics are calculated

    as expressed in equations (2.46), (2.47) and (2.48), where A is the set of states at time k-1

    connected to state sk, and B is the set of states at time k+1 connected to state sk.

    1

    1 1 1( ) max ( ) ( )k

    k k k k k k ks A

    s s s s

    (2.46)

    1

    1 1 1 1( ) max ( ) ( )k

    k k k k k k ks B

    s s s s

    (2.47)

    1 1 2 2 1 1 2 2

    1

    ( )

    ,

    ( ) ln ( | ) ( )

    ( )2

    k k k k k k

    s s s s p p p p zc

    k k k k k k k k e IN

    s s P P u z

    Lx y x y x y x y L

    y x

    (2.48)

    where z belongs to {00,01,10,11} , uk is the input symbol consisting of two bits, P(uk)

    is a priori probability of uk, and xk and yk are transmitted and received codewords

    associated with uk, respectively. The superscripts p and s denote the parity bits and

    systematic bits, respectively. In (4), ( ),

    z

    e INL is the extrinsic information received from the

    other SISO decoder and the code is assumed to be transmitted through an AWGN channel

    with a noise variance 2 . Since the Max-log-MAP decoding algorithm is independent of

    the signal-to-noise ratio (SNR), 22cL is usually set to a constant value, although it

    can be obtained from channel estimation [8].

    After the turbo decoder has completed a fixed number of iterations or met some other

    convergence criteria, a final decision on the bits must be made. This is accomplished by

    computing the LLR of each bit in the couple (Ak, Bk) according to

  • 32

    10 11

    00 01

    01 11

    00 10

    ( ) max ,

    max ,

    ( ) max ,

    max ,

    k k k

    k k

    k k k

    k k

    A

    B

    (2.49)

    where 00 0k . The hard bit decisions can be found by comparing each of these likelihood

    ratios to a threshold.

    2.4 Turbo code in 3GPP-LTE

    The newly evolved standard, 3GPP long term evolution (3GPP-LTE), which is the

    successor to GSM/UMTS mobile standard, is considered to be major step towards 4th

    generation (4G) mobile broadband systems. The channel coding in LTE involves Turbo

    Code with an internal interleaver based on the quadratic permutation polynomial (QPP).

    2.4.1 Encoding

    Figure 2.6 shows the structure of a 3GPP-LTE turbo encoder. The transfer function of

    each component encoder is given as the following equation.

    1

    0

    ( )( ) 1,

    ( )

    g DG D

    g D

    (2.50)

    where 2 3

    0 ( ) 1g D D D and 3

    1( ) 1g D D D .

    The trellis diagram of a 3GPP-LTE turbo encoder is shown in Figure 2.7. Trellis

    diagram is a state diagram which explicitly shows all possible state transitions of the

    component encoder at each discrete time instants. The component encoder has 8-state.

    Since turbo codes are recursive, it is not possible to terminate the trellis by transmitting

    zero tail bits. Trellis termination means driving the encoder to the all-zero state. This is

    required at the end of each block to make sure that the initial state for the next block is the

    all-zero state. The tail bits depend on the state of the component encoder after N

    information bits. A simple solution to this problem is shown in Figure 2.6. A switch in each

    parallel component encoder is in position A for the first N clock cycles and in position

  • 33

    B for 3 additional cycles. This will drive the encoder to the all-zero state. Trellis

    termination is based on setting the input to the first shift register to zero. This will flush the

    register with zeros after 3 shifts. The transmitted bits for trellis termination shall then be:

    00 0

    1

    0 0

    1

    0

    0

    3

    2

    0

    1

    0

    0

    3

    2

    0

    5

    4

    0

    7

    6

    0

    1

    0

    0

    3

    2

    0

    5

    4

    0

    7

    6

    0

    1

    0

    0

    3

    2

    0

    5

    4

    0

    7

    6

    0

    1

    0

    0

    3

    2

    0

    5

    4

    0

    7

    6

    0

    1

    0

    0

    3

    2

    0

    5

    4

    0

    7

    6

    00

    02

    04

    06

    00

    04

    00

    1 2 3 4 5 NN-1 N+1 N+2 N+3 Figure 2.7: A Trellis Diagram of a 3GPP-LTE Turbo Encoder

    D

    +

    D D

    +

    +

    +

    Interleaver

    XP1

    Uk

    Input

    Output

    1st component encoder

    Uk D

    +

    D D

    +

    +

    XP22st component encoder

    +

    Xs

    A

    B

    A

    B

    Figure 2.6: A Turbo Encoder for 3GPP-LTE

  • 34

    1 1 1 2 2 2

    1 1 2 2 3 3 1 1 2 2 3 3, , , , , , , , , , ,s p s p s p s p s p s p

    N N N N N N N N N N N Nx x x x x x x x x x x x

    where N is the number of bits.

    2.4.2 Decoding

    Based on the MAP algorithm, how to decode the single-binary turbo codes is well

    described in Section 2.2.3. In this Section, radix-4 single-binary turbo decoding based on

    Max-log-MAP is presented.

    2.4.2.1 Radix-4 Single-Binary Max-log-MAP Algorithm for Decoding

    By merging two trellis sections, the SISO decoder can process two bits per each cycle

    [18][19]. Accordingly, the forward, backward, and branch metrics denoted by , , and ,

    respectively, can be defined as follows.

    2 1

    2 2 2,

    ( ) max ( ) ( )k k

    k k k k k k ks s

    s s s s

    (2.51)

    1 2

    2 2 2,

    ( ) max ( ) ( )k k

    k k k k k k ks s

    s s s s

    (2.52)

    2

    1 1 1

    , 1 1 1 1 , 1

    ( ) ln ( | ) ( 1)

    + ln ( | ) ( 1)

    ( ) ( )

    k k k k k k

    k k k

    s s p p s s p p

    k k k k e IN k k k k k e IN k

    s s P P v

    P P v

    x y x y L v x y x y L v

    y x

    y x (2.53)

    where sk is the state of an encoder at time k and vk is the input bit. Also, P(vk) is a priori

    probability of vk, xk and yk are transmitted and received codewords associated with vk,

    respectively. The superscripts p and s denote the parity bits and systematic bits,

    respectively. In (2.53), , ( )e IN kL v is the extrinsic information received from the other SISO

    decoder. As indicated in (2.46)-(2.48) and (2.51)-(2.53), the radix-4 single-binary SISO

    decoding is almost the same with the double-binary SISO decoding which enables the

    efficient unified SISO decoder implementation for both decodings.

  • 35

    Chapter 3

    Border Metric Encoding

    This chapter presents an energy-efficient soft-input soft-output (SISO) decoder based

    on border metric encoding, which is especially suitable for nonbinary circular / high-radix

    single-binary turbo codes. In the proposed method, the size of the branch memory is

    reduced to half and the dummy calculation is removed at the cost of a small-sized memory

    that holds encoded border metrics. Due to the infrequent accesses to the border memory

    and its small size, the energy consumed for SISO decoding is reduced by 25.3%.

    3.1 Radix-4 SISO Decoding

    As expressed in the previous chapter, the metric calculation complexity of the

    nonbinary / high-radix single-binary turbo codes is higher than that of the single-binary

    turbo codes. For the double-binary / radix-4 single-binary turbo codes, the number of

    branches connected to each trellis state is increased from two to four as shown in Figure

    3.1. Since a max operation with four operands can be implemented by using three max

    operations with two operands as shown in (3.1), the hardware complexity is almost three

    times higher than that of the classical single-binary turbo codes if the four-operand max

    operation is computed in a cycle.

    max , , , max max , ,max ,a b c d a b c d (3.1)

    It is possible to compute the four-operand max operation serially using a two-operand max

    operator, but this structure requires more than one cycles and additional buffers to hold the

    intermediate values. Moreover, the serial max computation results in severe throughput

    degradation, as the forward and backward metrics are recursively defined using the

  • 36

    previously calculated metrics. Compared to the single-binary SISO decoders [6][7], the

    wordlength of internal metrics should be increased in hardware implementation, as the

    number of terms to be added in the branch metric calculation is increased from three to

    five as expressed in (2.48).

    3.1.1 Sliding Window for nonbinary SISO Decoding

    The sliding window technique is effective in reducing the memory size required to

    store metric values. A large frame is split into a number of small windows and the MAP

    decoding is applied to each window independently [16]. Figure 3.2(a) shows the

    0

    1

    2

    3

    4

    5

    6

    7

    xksxk+1

    s /xk

    pxk+1

    p

    00/00 10/11 01/01 11/10

    10/10 00/00 11/11 01/01

    11/01 01/10 10/00 00/11

    01/11 11/00 00/10 10/01

    01/10 11/01 00/11 10/00

    11/00 01/11 10/01 00/10

    10/11 00/00 11/10 01/01

    00/01 10/10 01/00 11/11

    sk

    0

    1

    2

    3

    4

    5

    6

    7 (a)

    0

    1

    2

    3

    4

    5

    6

    7

    xks1xk

    s2 /xk

    p1xkp2

    10/11 01/11 00/00 11/00

    00/10 11/10 10/01 01/01

    10/01 01/01 00/10 11/10

    01/00 10/00 11/11 00/11

    11/11 00/11 01/00 10/00

    01/10 10/10 11/01 00/01

    11/01 00/01 01/10 10/10

    sk sk+1

    0

    1

    2

    3

    4

    5

    6

    7

    (b)

    Figure 3.1: Trellis Diagrams for (a) Radix-4 Single-Binary Turbo Code in 3GPP-

    LTE and (b) Double-Binary Turbo Code in Mobile WiMAX

  • 37

    conventional sliding window diagram where forward metrics are calculated prior to

    backward metrics [6]. In the sliding window technique, however, the initial values at the

    border of each window are also required. To obtain the reliable initial values of each

    window, the dummy calculation is performed for the backward metrics as shown in Figure

    3.2(a). If the window size is sufficiently long, the initial values obtained by the dummy

    Trellis Time (Blocks)

    Processing Time

    L

    2L

    3L

    4L

    5L

    6L

    T 2T 3T 4T 5T 6T 7T 8T 9T 10T0

    N = 7L

    Branch Metric

    Calculation

    Dummy

    Backward Metric

    Calculation

    Forward Metric

    Calculation

    Backward Metric Calculation

    &

    LLR Calculation

    Dummy Calc.

    ActiveDuty Ratio 1

    (a)

    Trellis Time (Blocks)

    Processing Time

    L

    2L

    3L

    4L

    5L

    6L

    T 2T 3T 4T 5T 6T 7T 8T 9T 10T0

    N =7L Border Metric Store

    Border Metric Load

    Branch Metric Calculation

    Forward Metric Calculation

    Backward Metric Calculation

    LLR Calculation

    0

    1

    2

    3

    4

    5

    0

    1

    2

    3

    4

    5

    Load Active

    Store ActiveDuty Ratio 2/WS

    n-1

    n

    n+1Load

    Address

    Store

    Address

    n-2

    n-1

    n

    (b)

    Figure 3.2: Sliding window diagrams (a) with dummy calculation and (b) with

    border memory

  • 38

    calculation do not degrade performance.

    Another way to obtain reliable border metric values is to use those values of the

    previous iteration, which has been adopted for the classical radix-2 single-binary turbo

    codes [17]. In this dissertation, this approach is employed with modifying it for the double-

    binary / radix-4 single-binary turbo codes. For each window, the last backward metric is

    stored in a memory called the border memory. The stored border metrics are loaded in the

    next iteration to regard them as the initial backward metric values at the borders as

    illustrated in Figure 3.2(b). Since there is no stored value in the first iteration, all states at

    the borders are assumed to be equiprobable in the first iteration. Compared to the

    conventional method based on the dummy calculation, this approach results in slight

    performance degradation for the earlier iterations, but the performance degradation

    disappears after a few iterations. By using the metrics stored in the previous iteration, we

    can completely avoid the dummy backward metric calculation. Additionally, the size of the

    branch metric memory is reduced to half since the number of processes in which the

    branch metrics are participated is changed from four to two.

    3.2 Proposed Border Metric Encoding

    As described in the previous section, an additional memory is needed to hold the border

    metric values of the previous iteration. Although the sliding window with the border

    memory can eliminate the need of the dummy calculation, the border memory size is

    considerable. To achieve more area-efficient and energy-efficient turbo decoding, the

    border memory should be minimized. If the maximum frame size is Nmax, the number of

    states in trellis is K, and state metric values are represented in P bits, then the border

    memory size (BM) is defined as follows.

    max 1N

    BM K PWS

    (3.2)

    where WS is the window size. Since Nmax and K are fixed for a standard, the border

    memory size depends only on the window size and the wordlength of state metrics. To

    reduce the border memory size, we can either increase the window size or decrease the

  • 39

    wordlength of state metrics. Increasing the window size, however, increases the sizes of

    the memories storing the forward and branch metrics, and the window size is usually set to

    32 for 8-state trellis. Therefore, we should decrease P to reduce the overall border memory

    size. Otherwise the sliding window associated with the border memory may not be suitable

    for the hardware implementation because a large border memory is indispensable for the

    3GPP-LTE whose Nmax is 6144 (3072 in case of radix-4 processing) and for the WiMAX

    whose Nmax is 2400.

    The reduction of the border memory can be realized by allowing a few values to represent

    the border metrics. Though the reliability of the border metric is slightly decreased due to

    the loss of accuracy, this can be totally recovered after a few trellis stages. A simple

    encoding with low hardware complexity is to floor the original metric value to

    Original

    Value

    Encoded

    Value

    64

    -64

    64

    16 32

    16

    32

    -32 -16

    -16

    -32

    -64

    Figure 3.3: 3-bit border metric encoding function

    Table 3.1 Simulation environment

    SISO Algorithm Max-log-MAP

    Window Size 32

    Quantization

    Received input : (4, 2) Branch Metric : (10, 2)

    State Metrics : (10, 2)

    Extrinsic Information : (8, 2)

    LLR value : (11, 2)

  • 40

    the closest power-of-two number. The experimental environment for the WiMAX is

    indicated in Table 3.1, where (q, f) denotes a quantization scheme that uses q bits in total

    and f bits to represent the fractional part. The final quantization schemes shown in Table

    3.1 are determined by performing several simulations and referring to [6] and [8]. The

    encoding function for the proposed 3-bit encoding is depicted in Figure 3.3. The encoding

    function for the 4-bit encoding can be similarly defined. Possible values at the border are

    listed in Table 3. 2. As the range of the original border metrics is [-512, +511] which can be

    represented with 10 bits, the proposed border metric encoding can be obtained by limiting

    the value into [-256, +256] for the 4-bit encoding and [-64, +64] for the 3-bit encoding and

    by allowing only power-of-two values. In Figure 3.4, the BER performance of the

    proposed encoding is compared with those of various methods. The schemes in which the

    border metric is initialized with the value of the previous iteration degrade the performance

    Figure 3.4: BER performance comparison with 8 iterations for 4800-bit frame.

    Table 3.2 Encoded values for border metrics

    Encoding Scheme Encoded values for border metric

    4-bit encoding 256, 128, 64, 32, 16, 8, 4, 0

    3-bit encoding 64, 32, 16, 0

  • 41

    by about 0.02 dB in the water fall region. If the SNR is higher than 1 dB, however, the

    proposed 4-bit encoding shows about 0.1 dB better performance than the classical method

    which uses the dummy calculation. It is well-known that we can obtain better performance

    for the Max-log-MAP algorithm by scaling the extrinsic information [10]. Since the floor

    function reduces the border metrics, its effect is similar to the extrinsic information scaling.

    When the SNR is high, the performance of the 3-bit encoding is degraded by about 0.1 dB

    compared to the other schemes as the values are restricted to a relatively small region. The

    size of the border memory can be reduced significantly by using the proposed encoding. As

    the iteration proceeds, the BER degradation resulting from the proposed encoding scheme

    becomes negligible as shown in Figure 3.5. It has been reported that we can achieve higher

    bandwidth efficiency for triple-binary turbo codes [18] and obtain better performance if the

    number of states in the trellis increases [19]. Since these two factors increase the

    complexity of the dummy backward metric calculation, the sliding window associated with

    the proposed border metric encoding can be more effective for the future turbo codes.

    4th

    Iteration

    2nd

    Iteration

    1st

    Iteration

    6th

    Iteration

    8th

    Iteration

    Figure 3.5: BER performance of 1920-bit frame according to the number of

    iterations

    .

  • 42

    3.3 Experimental Results

    With the quantization indicated in Table 3.1, a Max-log-MAP decoder based on the

    proposed border metric encoding was described in Verilog-HDL and synthesized with a

    0.18 m 4-Metal CMOS standard-cell library and compiled SRAM memories. Design

    Compiler and Power Compiler of Synopsys were used for the synthesis and power

    estimation, respectively. Switching activities resulting from gate-level simulation were

    annotated for gate-level power estimation. The window size is set to 32 and the 4-bit

    border metric encoding is employed. In the hardware implementation, the forward metrics

    and backward metrics are normalized by subtracting the value of state 0, 0( )k s and

    0( )k s , from other metrics at the same trellis stage in order to avoid overflow in state

    metrics, which also eliminates the need to store the metric value of state 0. Since the SISO

    decoder takes two systematic bits and two parity bits as inputs, the number of possible

    branch metrics is 16 while the number of possible branch metrics is 4 in the classical

    single-binary turbo codes. Among the 16 possible branch metrics, only 8 branch metrics

    are distinguishable and sufficient to derive the others. Although the number of branch

    metrics to be stored is reduced by half, the branch memory size is still considerable if the

    conventional sliding window with the dummy calculation is adopted as indicated in Table

    3.3. Even in the case that the sliding window is associated with the border memory, the

    total memory size is increased because of the border memory requirement. By applying the

    proposed border metric encoding method, the total memory size needed in the SISO

    decoder is reduced by 20.7% as summarized in Table 3.3.

    In Table 3.4, the energy consumption of the proposed SISO decoder is compared with

    that of the conventional decoder, which is measured for 1.2 dB SNR and 8 iterations at the

    operating frequency of 200MHz. Due to the increased computational complexity of the

    double-binary turbo codes, the energy consumption of the SISO logic is also increased

    compared to the classical single-binary turbo codes [6][7]. As shown in Table V, the energy

    consumption of the SISO logic is reduced by eliminating the dummy calculation. Also, as

    shown in Figure 3.2(b), the energy consumption of the border memory is very low because

    the memory is small and infrequently accessed. While processing a window, we need to

    access the border memory only two times one for load and the other for store. For the

  • 43

    case of dummy calculation, however, the dummy calculation logic should operate almost

    all the time as indicated in Figure 3.2(a). Therefore, the proposed SISO decoder can reduce

    the energy consumption by 25.3% compared to the conventional SISO decoder based on

    the dummy calculation and the table-based interleaver.

    Table 3.4 Energy consumptions of SISO decoders

    With

    Dummy

    Calculation

    With

    Border Memory

    (4-bit Encoding)

    SISO Logic 2347.9 pJ/bit/iter 1876.9 pJ/bit/iter

    Branch Memory 1288.4 pJ/bit/iter 649.9 pJ/bit/iter

    Forward Memory 559.1 pJ/bit/iter 559.1 pJ/bit/iter

    Border Memory N.A. 49.4 pJ/bit/iter

    Total 4195.4 pJ/bit/iter

    (100%)

    3135.3 pJ/bit/iter

    (74.7 %)

    Table 3.3 Single-port SRAM size required for a SISO decoder

    With

    Dummy

    Calculation

    With

    Border Memory

    (No Encoding)

    With

    Border Memory

    (4-bit Encoding)

    Forward

    Memory

    2 banks,

    32*(10*7)

    bits/bank

    2 banks,

    32*(10*7)

    bits/bank

    2 banks,

    32*(10*7)

    bits/bank

    4480 bits 4480 bits 4480 bits

    Branch

    Memory

    4 banks, 32*(10*8)

    bits/bank

    2 banks, 32*(10*8)

    bits/bank

    2 banks, 32*(10*8)

    bits/bank

    10240 bits 5120 bits 5120 bits

    Border

    Memory N.A.

    [(2400/32)-1] *(10*7) bits

    = 5180 bits

    [(2400/32)-1] *(4*7) bits

    = 2072 bits

    Total 14720 bits (100%) 14780 bits

    (100.4 %)

    11672 bits

    (79.3 %)

  • 44

    Chapter 4

    Bit-Level Extrinsic Information Exchange

    The nonbinary turbo code has many advantages over the single-binary turbo code, but

    its decoding requires more memory especially for storing the extrinsic information to be

    exchanged between the two soft-input soft-output (SISO) decoders. To reduce the memory

    size required for double-binary turbo decoding, this paper presents a new method to

    convert the symbolic extrinsic information to the bit-level information and vice versa. By

    exchanging the bit-level extrinsic information, the number of extrinsic information values

    to be exchanged in double-binary turbo decoding is reduced to the same amount as single-

    binary turbo decoding. Since the size of the extrinsic information memory is significant,

    the proposed method is effective in reducing the total memory size needed in double-

    binary turbo decoders. A double-binary turbo decoder is designed for the WiMAX standard

    to verify the proposed method, which reduces the total memory size by 28.4%.

    4.1 Extrinsic Information in Double-Binary Turbo Codes

    A typical turbo decoder consists of two SISO decoders serially concatenated via an

    interleaver. Focusing on non-binary turbo decoding, we describe in this Section the

    conventional symbol-level extrinsic information and the implementation issues.

    4.1.1 Symbol-level Extrinsic Information in Double-Binary Turbo Codes

    In an m-ary turbo code where a symbol is represented in m bits, the number of possible

    symbol-level extrinsic information values is 2m-1 [18]. Since the value of m is two for the

    double-binary turbo code, three symbol-level extrinsic information values are defined as

  • 45

    follows [13][14].

    ( )

    ln ( ) ( 00) exp[ ] ( 00)

    z zke k k e

    k

    p u zL p u z p u L

    p u

    (4.1)

    where z belongs to {01,10,11} , uk is the input symbol consisting of two bits and ( )p

    means the probability. The extrinsic information is exchanged iteratively between the two

    SISO decoders during the whole decoding process. As indicated in (4.1), the extrinsic

    information in the double-binary turbo code is defined as the ratio of two input symbols

    each of which consists of two bits. In non-binary turbo decoding, more extrinsic

    information values are to be exchanged compared to the classical single-binary turbo

    decoding that stores only one extrinsic information value. To store the increased number of

    extrinsic information values, therefore, a large memory is needed in implementing a

    nonbinary turbo decoder.

    4.1.2 Memory Requirement in Double-Binary Turbo Decoder

    A typical turbo decoder is based on the time-multiplex architecture that contains only

    one SISO decoder, one interleaver, and one extrinsic memory. The first and second SISO

    decoding processes of an iteration are time-multiplexed in the architecture [20]. To achieve

    a high throughput, several SISO decoders can be employed to decode the turbo code in

    parallel [18]. For the SISO decoder, the sliding window technique, where a large frame is

    split into a number of small windows and the MAP decoding is applied to each window

    independently, is widely used to reduce the memory needed to store metric values [16]. To

    avoid the complex dummy metric calculation required in the sliding window technique, we

    can adopt the border memory in nonbinary turbo decoding as described in the previous

    chapter. However, for the extrinsic information memory cannot be reduced even if the

    sliding window technique is employed, and the size is determined by the largest frame size

    that is 2400 pairs in the WiMAX [3].

    The experimental environment for the WiMAX is indicated in Table 4.1, where (q, f)

    denotes a quantization scheme that uses q bits in total and f bits to represent the fractional

    part. Taking into account the quantization scheme indicated in Table 4.1, the memory size

    required for a double-binary SISO decoder is summarized in Table 4.2, and that for storing

  • 46

    extrinsic information values in Table 4.3. It is crucial to reduce the extrinsic information

    memory even if several SISO decoders are adopted for parallel decoding, as the extrinsic

    information memory is much bigger than the memory required in SISO decoding as

    indicated in Table 4.2 and 4.3 and Figure 4.1.

    It has been report