12
Contents lists available at ScienceDirect INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi High-throughput partial-parallel block-layered decoding architecture for nonbinary LDPC codes Huyen Pham Thi, Sabooh Ajaz, Hanho Lee ,1 Dept. of Information and Communication Engineering, Inha University, Incheon 22212, Republic of Korea ARTICLE INFO Keywords: Nonbinary LDPC Iterative decoding Minmax Block-layered decoding ABSTRACT This paper presents a novel forward-backward four-way merger min-max algorithm and high-throughput decoder architecture for nonbinary low-density parity-check (NB-LDPC) decoding, which signicantly reduces decoding latency. An ecient partial-parallel block-layered decoder architecture suitable for the proposed forward-backward four-way merger algorithm is presented to speed up the decoder convergence. Moreover, a parallel switch network architecture and parallel-serial check node unit are also proposed to facilitate the implementation of the proposed decoder architecture. The proposed algorithm can reduce the number of check node processing steps by half. Consequently, the decoder architecture using the proposed algorithm can achieve a considerably higher throughput, compared to previous works. Two quasi-cyclic NB-LDPC (QC-NB-LDPC) codes over GF(32) as (837, 726) and (744, 653) are synthesized using a 90-nm CMOS technology. The implementation results demonstrate that the proposed decoder architecture can operate at a 370 MHz clock frequency, and the throughputs of these two codes are 92.6 Mbps and 118.86 Mbps, respectively. 1. Introduction Low-density parity-check (LDPC) code [1] has shown a great error- correcting performance, close to the Shannon limit for long code lengths. In recent years, LDPC codes have attracted signicant atten- tion because of their excellent error-correcting performance, inherent parallelism, and high throughput potential [24]. Research in [5] has shown that Nonbinary LDPC (NB-LDPC) codes dened over GF(q) convincingly outperform their binary counterparts in terms of error- correcting performance when the code length is moderate. Nonetheless, the NB-LDPC decoding algorithms require complex computations, and their architectures have very high complexity and large memory requirements. Originally, the belief propagation (BP) algorithm was used for binary LDPC decoding. The BP algorithm for NB-LDPC decoding was then proposed [5]. In [6], a fast Fourier transform (FFT) in the probability domain was applied to the BP algorithm, which replaced the convolutional operations with multiplications in the frequency domain to reduce the computational complexity from O q ( ) d c to O q q ( log ) 2 in check node processing. Although the probability domain algorithms are well known for their optimal performance, a tremen- dous number of additions and multiplications cause an exponential increase in hardware complexity. To address this problem, the loga- rithm domain algorithms use log-likelihood ratio (LLR) values to decode channel messages [7,8] substituted for those in the probability domain, in which the multiplications are replaced with additions. The FFT-BP decoding in the logarithm domain for NB-LDPC code [7] demonstrated its advantages in both decoding complexity and numer- ical stability. To implement the practical NB-LDPC decoder architecture, sub- optimal algorithms such as min-sum [9] and min-max [10] are used. Savin [10] used a max operation instead of a sum operation for check node processing. Because the max operation can be performed more easily than the sum operation with respect to VLSI implementation, the min-max algorithm has been widely used to reduce hardware complex- ity [1118]. Both of these algorithms apply the LLR values to decode channel messages. In addition, several algorithms have been proposed to simplify the hardware implementation for the NB-LDPC decoder such as the trellis min-max (TMM) algorithm [17,18], the symbol reliability based algorithm [19], and the stochastic algorithm [20]. However, these algorithms suer from a performance degradation. The forward-backward min-max algorithm [10] is a well-known algorithm that provides good decoding performance and a simplied hardware implementation. The critical issue in the practical implementation of the NB-LDPC decoders based on the forward-backward min-max algorithm is a throughput problem, and the main bottleneck of the http://dx.doi.org/10.1016/j.vlsi.2017.05.005 Received 2 March 2016; Received in revised form 6 March 2017; Accepted 19 May 2017 Corresponding author. 1 Postal Address: Inha University, 100 Inha-ro, Nam-gu, Incheon 22212, Republic of Korea. E-mail address: [email protected] (H. Lee). INTEGRATION the VLSI journal 59 (2017) 52–63 Available online 19 May 2017 0167-9260/ © 2017 Elsevier B.V. All rights reserved. MARK

INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

Contents lists available at ScienceDirect

INTEGRATION, the VLSI journal

journal homepage: www.elsevier.com/locate/vlsi

High-throughput partial-parallel block-layered decoding architecture fornonbinary LDPC codes

Huyen Pham Thi, Sabooh Ajaz, Hanho Lee⁎,1

Dept. of Information and Communication Engineering, Inha University, Incheon 22212, Republic of Korea

A R T I C L E I N F O

Keywords:Nonbinary LDPCIterative decodingMin–maxBlock-layered decoding

A B S T R A C T

This paper presents a novel forward-backward four-way merger min-max algorithm and high-throughputdecoder architecture for nonbinary low-density parity-check (NB-LDPC) decoding, which significantly reducesdecoding latency. An efficient partial-parallel block-layered decoder architecture suitable for the proposedforward-backward four-way merger algorithm is presented to speed up the decoder convergence. Moreover, aparallel switch network architecture and parallel-serial check node unit are also proposed to facilitate theimplementation of the proposed decoder architecture. The proposed algorithm can reduce the number of checknode processing steps by half. Consequently, the decoder architecture using the proposed algorithm can achievea considerably higher throughput, compared to previous works. Two quasi-cyclic NB-LDPC (QC-NB-LDPC)codes over GF(32) as (837, 726) and (744, 653) are synthesized using a 90-nm CMOS technology. Theimplementation results demonstrate that the proposed decoder architecture can operate at a 370 MHz clockfrequency, and the throughputs of these two codes are 92.6 Mbps and 118.86 Mbps, respectively.

1. Introduction

Low-density parity-check (LDPC) code [1] has shown a great error-correcting performance, close to the Shannon limit for long codelengths. In recent years, LDPC codes have attracted significant atten-tion because of their excellent error-correcting performance, inherentparallelism, and high throughput potential [2–4]. Research in [5] hasshown that Nonbinary LDPC (NB-LDPC) codes defined over GF(q)convincingly outperform their binary counterparts in terms of error-correcting performance when the code length is moderate.Nonetheless, the NB-LDPC decoding algorithms require complexcomputations, and their architectures have very high complexity andlarge memory requirements.

Originally, the belief propagation (BP) algorithm was used forbinary LDPC decoding. The BP algorithm for NB-LDPC decoding wasthen proposed [5]. In [6], a fast Fourier transform (FFT) in theprobability domain was applied to the BP algorithm, which replacedthe convolutional operations with multiplications in the frequencydomain to reduce the computational complexity from O q( )dc toO q q( log )2 in check node processing. Although the probability domainalgorithms are well known for their optimal performance, a tremen-dous number of additions and multiplications cause an exponentialincrease in hardware complexity. To address this problem, the loga-

rithm domain algorithms use log-likelihood ratio (LLR) values todecode channel messages [7,8] substituted for those in the probabilitydomain, in which the multiplications are replaced with additions. TheFFT-BP decoding in the logarithm domain for NB-LDPC code [7]demonstrated its advantages in both decoding complexity and numer-ical stability.

To implement the practical NB-LDPC decoder architecture, sub-optimal algorithms such as min-sum [9] and min-max [10] are used.Savin [10] used a max operation instead of a sum operation for checknode processing. Because the max operation can be performed moreeasily than the sum operation with respect to VLSI implementation, themin-max algorithm has been widely used to reduce hardware complex-ity [11–18]. Both of these algorithms apply the LLR values to decodechannel messages. In addition, several algorithms have been proposedto simplify the hardware implementation for the NB-LDPC decodersuch as the trellis min-max (TMM) algorithm [17,18], the symbolreliability based algorithm [19], and the stochastic algorithm [20].However, these algorithms suffer from a performance degradation. Theforward-backward min-max algorithm [10] is a well-known algorithmthat provides good decoding performance and a simplified hardwareimplementation. The critical issue in the practical implementation ofthe NB-LDPC decoders based on the forward-backward min-maxalgorithm is a throughput problem, and the main bottleneck of the

http://dx.doi.org/10.1016/j.vlsi.2017.05.005Received 2 March 2016; Received in revised form 6 March 2017; Accepted 19 May 2017

⁎ Corresponding author.

1 Postal Address: Inha University, 100 Inha-ro, Nam-gu, Incheon 22212, Republic of Korea.E-mail address: [email protected] (H. Lee).

INTEGRATION the VLSI journal 59 (2017) 52–63

Available online 19 May 20170167-9260/ © 2017 Elsevier B.V. All rights reserved.

MARK

Page 2: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

NB-LDPC decoder is in the check node unit (CNU).Generally, the forward-backward recursive algorithm is a preferred

choice in most existing NB-LDPC decoders [11–16] because of thesimplified hardware implementation. It is a three-step process in whichforward, backward, and merger steps are recursively performed. First,a decoder architecture based on the forward-backward recursivealgorithm for NB-LDPC codes is introduced in [11]. Let dc be a checknode degree or number of variable nodes connected to one check node.This decoder provides a computation time of 3 × (dc – 2) elementarycomputation steps (ECSs), in which (dc – 2) ECSs are spent on forwardprocessing, (dc – 2) ECSs are spent for backward processing, and (dc –2) ECSs are spent for merger processing. Each ECS takes q cycles, and atotal of 3 × (dc – 2) × q cycles are required for the decoding. In [14], thenumber of cycles per ECS is reduced to nm < q cycles by keeping nmvalues in each message vector, and the number of cycles for thedecoding is 3 × (dc – 2) × nm cycles. To further reduce the decodingtime, an overlapped scheduling of forward-backward check nodeprocessing is performed in [12], but the throughput improvement isstill a limitation. Recently, a two-way merging algorithm [15] and abidirectional recursion [21] have been introduced for check nodeprocessing to reduce the computation time to (dc – 1) ECSs at thecost of increased complexity. In [12,14,21], only nm < q values permessage vector are kept to reduce the memory requirement and thenumber of cycles per ECS to nm cycles. However, variable nodeprocessing architectures and the controller become complex, andadditional memory is required to store the indexes. By keeping qvalues per message vector [11,13,15,16], q cycles per ECS are required,and all q values need to be stored. This leads to simplification in thevariable node processing and the controller. Furthermore, storingindices is not necessary. It is clear that it suffers from a low throughputproblem, especially for large dc. To reduce the computation time, eitherthe number of ECSs or the number of cycles per ECS needs to bereduced.

In this work, a novel forward-backward four-way merger min-max(FB4M-MM) algorithm keeping q values in each message vector isproposed to further reduce the decoding time to d( ⌊ /2⌋ + 1 )c ECSs.Compared with [15,21], the proposed algorithm provides a reductionof the decoding time by almost a half. An efficient partial-parallelblock-layered decoder architecture is designed to implement theproposed algorithm. Moreover, a parallel switching network and aparallel-serial elementary computation unit (PS-ECU) architecture arealso proposed to facilitate the implementation of the proposed decoderarchitecture.

The remainder of this paper is organized as follows. Section 2 givesa brief review of the NB-LDPC code and min-max decoding algorithms.Section 3 presents the parallel elementary computation unit (P-ECU)and the proposed forward-backward four-way merger min-max algo-rithm. In Section 4, the proposed CNU architecture is presented. Theproposed partial-parallel block-layered decoder architecture is pre-sented in Section 5. The implementation and comparison results arediscussed in Section 6. Finally, conclusions are drawn in Section 7.

2. Review of NB-LDPC codes and decoding algorithms

2.1. NB-LDPC codes

Let α be a primitive element of Galois-Field GF(q). Then, the GF(q)has q elements as α α α{0, , , ... , }q0 1 −2 . Let H be a parity-check matrix of(N, K) NB-LDPC code (N, K and M = N – K are code symbols,information symbols and parity symbols, respectively), where thenonzero elements of H are nonzero symbols of GF(q) and are calledhmn (0 ≤ m < M, 0 ≤ n < N). A NB-LDPC code can be defined by aTanner graph corresponding H matrix. In the Tanner graph, each row(column) of the H matrix is associated with a check (variable) node,and a check node is connected to a variable node if the correspondingentry in the H matrix is nonzero. Let V V V V= ( , , ... , )n n n nN0 1 −1 , where

V GF q∈ ( )ni , be a codeword. In the parity-check equation of the NB-LDPC code, all check nodes must satisfy

∑ h V( ) = 0i

d

mn n=0

−1c

i i(1)

There are several methods to generate NB-LDPC codes such asusing their binary images [22] and algebraic construction [23]. In thiswork, the Quasi-cyclic NB-LDPC (QC-NB-LDPC) codes, which areconstructed by the algebraic construction method based on array-dispersions of matrices in [23], are applied. The QC-NB-LDPC codesare well-known for their error-correcting performance and efficientparallel processing in decoders. However, the complexity of NB-LDPCcodes is very high, especially when Galois-Field order is increased. NB-LDPC codes over GF(32), which have been widely used in previousdecoders [11–14,16,17,21], are applied in this work to achieve asuitable hardware complexity in the implementation as well ascomparison. A (31×31) submatrix over GF(32) is generated using themethod in [23]. Let dv be a variable node degree or number of checknodes connected to one variable node. A submatrix with a size of (dv,dc) is then taken from the (31×31) submatrix. Each element in the (dv,dc) submatrix is dispersed in either a zero matrix or an α-multipliedcirculant permutation matrix (CPM) of size (q – 1) × (q – 1). Theα-multiplied CPM has properties described as follows. The first row ofthe matrix has one i-th nonzero entry, which corresponds to the αi

element of the submatrix; the other entries are zero. The other rows areconstructed by a right cyclic-shift of the previous row multiplied by α.As a result, the H matrix generated from the (dv, dc) submatrix hasdv×(q – 1) rows and dc×(q – 1) columns. In this work, a (dv, dc) = (4,27) submatrix is extended to generate (837, 726) QC-NB-LDPC code.Fig. 1 shows the H matrix for the (837, 726) QC-NB-LDPC code overGF(32) and an α2-multiplied circulant permutation matrix.

2.2. Block-layered min-max decoding algorithm

Because a layered decoding algorithm can speed up the convergenceof the iterative decoder by approximately half and provide a reductionof the memory requirement, this algorithm has been widely used inrecent works [12–15,17,21]. Hence, an efficient horizontal layereddecoding scheme is applied in this work. Moreover, the QC-NB-LDPCcodes described above are used to employ the capability of processing

Fig. 1. (a) H matrix for (837, 726) QC-NB-LDPC code over GF(32), (b) α2-multipliedCPM.

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

53

Page 3: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

in parallel. A total of (q – 1) non-overlapped rows of H matrix aregathered to generate a layer. Consequently, (dv, dc) QC-NB-LDPC codecan be divided into dv layers. The proposed block-layered min-maxdecoding algorithm is presented in Algorithm 1. In this algorithm, (q –

1) non-overlapped rows of the H matrix are simultaneously processedsince each column of the layer has a weight value of one. The dv layersare sequentially implemented to complete one decoding iteration. Witheach transmitted symbol, the corresponding received symbol in thereceiver can be any of q elements in GF(q). Therefore, the receivedmessages are vectors of size q. Let xn be the n-th symbol of a receivedcodeword, and sn defines the most reliable symbol. The reliableinformation of the channel is a non-negative vector Ln(a) including qLLRs, where a α α α∈ {0, , , ... , }q0 1 −2 is the indexes in GF(q), as shown

in the Initialization step of Algorithm 1. Sets L a( )∼nmk l,

and R a( )mnk l, are

variable-to-check node (V2C) messages (from n variable node to mcheck node) and check-to-variable node (C2V) messages (fromm checknode to n variable node) at the k-th iteration and l-th layer, respec-tively.

In the first layer of the first iteration, the V2C messages are thereliable information of the channel, and the C2V messages are equal tozero. Furthermore, the normalization step needs to be performed in thevariable node processing

Algorithm 1:. Block-layered min-max decoding algorithm

Initialization:L a x s channel x a channelL a L a R a

( ) = ln(Pr( = | )/Pr( = | ));( ) = ( ); ( ) = 0;

n n n n

n n mn1,0 0.1

Iterations:For k k I k( = 1; ≤ ; + +)max

For l l L l( = 1; ≤ ; + +)For m m q m( = 0; < − 1; + +)

VNP: L a L a R a( ) = ( ) − ( );∼nmk l

nk l

mnk l, , −1 −1,

L L a= min ( ( ));∼ ∼nmk l

a GF qnmk l,

∈ ( )

,

L a L a L( ) = ( ) − ;∼ ∼nmk l

nmk l

nmk l, , ,

CNP: R a( )mnk l, = FB4M-MM{L a N m( ) ∈ ( )nm

k l, };

VNP: L a L a R a( ) = ( ) + ( );∼nk l

nmk l

mnk l, , ,

EndEnd

Decision: c L a= arg min( ( ));∼n n

k l,

End

to ensure numerical stability and that the smallest LLR value ineach vector is always equal to zero. In the check node processing, theproposed FB4M-MM algorithm discussed later is applied to computethe C2V messages. The decoding process is continuously performeduntil the number of iterations reaches the maximum value Imax or theparity check equation is satisfied.

In the forward-backward min-max algorithm, several schemes suchas the overlapped scheduling [12] and the forward-backward two-waymerger min-max (FB2M-MM) algorithm [15] have been proposed tospeed up the check node processing (CNP). The latest work as theFB2M-MM algorithm introduces d( − 1)c ECSs for the CNP. The FB2M-MM algorithm is given in Algorithm 2. In this algorithm, the forwardand backward computations are performed in parallel during d( − 1)c

ECSs, and the merger computation is divided into right merger and leftmerger computations. After d⌊ /2⌋c steps of forward and backwardcomputations, the right and left merger computations are indepen-dently implemented, and carried out in parallel.

2.3. Elementary computation step of the forward-backward min-maxalgorithm

Because the check node processing in the NB-LDPC decodingalgorithm is extremely complex, it is difficult to implement thehardware architecture. Savin [10] presented a forward-backwardrecursive computation scheme, which decomposed the computationof C2V messages into several ECSs. In general, the ECS is derived asfollows.

L a L a L a( ) = min (max( ( ′), ( ″)))h a h a haa a GF q

′ ′+ ″ ″=′, ″∈ ( )

1 2

(2)

Algorithm 2:. Forward-backward two-way merger min-maxalgorithm [15].

Input: L a L a N m n d( ) = ( ) ∈ ( ); 1 ≤ ≤n nm c

Forward:

F a L aFor i i d i iF a F a L a

( ) = ( );( = 2; < ; = + 1)

( ) = min (max( ( ′), ( ″)));c

ih a h a h aa a GF q

i i

1 1

′+ ″= ′, ″∈ ( )−1

i i i−1

Backward:

B a L aFor j d j j jB a B a L a

( ) = ( );( = − 1; > 1; = − 1)

( ) = min (max( ( ′), ( ″)));

d d

c

jh a h a h aa a GF q

j j′+ ″= ′, ″∈ ( )

+1

c c

j j j+1

Merger:

M a B a M a F aβ d

( ) = ( ); ( ) = ( );= ⌊ /2⌋;

d d

c

1 2 −1c c

Left Merger:

For l β l l lM a B a F a

( = ; ≥ 2; = − 1)( ) = min (max( ( ′), ( ″)));l

a a aa a GF ql l

′+ ″= ′, ″∈ ( )+1 −1

Right Merger:

For k β k d k kM a B a F a

( = + 1; ≤ − 1; = + 1)( ) = min (max( ( ′), ( ″)));

c

ka a aa a GF q

k k′+ ″= ′, ″∈ ( )

+1 −1

Output: R a M a n d( ) = ( ); 1 ≤ ≤nm n c

where L a( ′)1 and L a( ″)2 are two input vectors, and L a( ) is the outputvector of the ECS. These vectors have a size of q LLR values. Let a, a′,and a′′ be indexes of vectors L a( ), L a( ′)1 and L a( ″)2 , respectively, whosevalues are in GF(q). Moreover, h, h′, and h′′ are fixed nonzero vectorselements in the Galois-Field GF(q). The forward-backward algorithm[10] indicates that the check node messages are generated by therecursive forward (FD), backward (BD) and merger (MG) calculationsusing the ECS, as shown in Eq. (2).

In [14], the output vector L(a) was directly computed from theconditional equation under the min in Eq. (2), as follows:

h a h a ha′ ′ + ″ ″ = (3)

For a a hh

a′ = 0: = ″ ″(4)

⎛⎝⎜

⎞⎠⎟For a a h a

hh

h aa′ ≠ 0: = ′ ′ 1 + ″

′ ′″

(5)

The ECS is implemented by an elementary computation unit (ECU)architecture proposed in [14] as shown in Fig. 2. This ECU architecture isnamed the serial-ECU (S-ECU) because each LLR value of vector L a( ′)1with index a′ yields the ECU in serial in each cycle, and q LLR values of

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

54

Page 4: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

vector L a( ″)2 in the order of L L α L α(0), ( ), ... , ( )q2 2

02

−2 yield the ECU inparallel. Accordingly, q LLR values of the output vector L(a) are in anarbitrary order in each cycle. To maintain the order ofL L α L α(0), ( ), ... , ( )q0 −2 after each cycle, a serial switch network isrequired to permute q LLR values of output vector L(a) in each cycle.The control signals for the serial switch network depend on the index a′ ofgiven LLR value L a( ′)1 . Eq. (4) shows the control signals as h h″/ in the caseof a′ =0, and Eq. (5) shows the control signal as h a h′ ′/ and h h a″/ ′ ′ in thecase of a′ ‡ 0. In addition, q recursive minimum (RM) finders are applied tooutput vector L(a) during q cycles to determine the final min values of theoutput vector. It is remarked that the S-ECU negatively affects the timingand consequently the throughput of the decoder.

A q-fold ECU architecture [13] was designed to simultaneouslycalculate q output values in one cycle by q ECU modules without aswitch network, as shown in Fig. 3. In this architecture, q elements ofthe two input vectors yield the q-fold ECU in parallel, and fixed wiresare used to derive each element in each input vector to a suitablelocation in each of the q modules. Although the q-fold ECU cansignificantly improve the throughput, this architecture causes highhardware complexity, and it is fairly impractical to implement a q-foldECU for higher orders of GF(q).

3. Proposed forward-backward four-way merger min-maxalgorithm

3.1. Proposed parallel elementary computation unit

An example for the ECS in a q-fold ECU architecture [13] is given

with a conditional equation such as α a α a α a′ + ″ =3 6 6 over GF(8). Tofind one LLR value L α( )r of output vector L a( ), q pairs a′ - a′′ thatsatisfy α a α a α α′ + ″ = r3 6 6 are obtained. Then, q LLR values areachieved from the large ones in each of the q pairs L a( ′)1 -L a( ″)2 . Theminimum of the q LLR values is the value L α( )r of output vector L a( ).Fig. 4 shows q pairs a′ - a′′ for computation of each LLR value L α( )r ofthe output vector L a( ) over GF(8). In the above analysis, the q-foldECU architecture is an impractical implementation to simultaneouslycreate q LLR values of vector L a( ) when increasing the value of q. Toovercome the disadvantage of the q-fold ECU, we propose a novelparallel switch network to permute the input vector L a( ′)1 with indexesa′ in the order of α α α{0, , , ... , }0 1 6 to the orders expected as shown inFig. 4 for computing L(a) values with index a. A parallel-ECU (P-ECU)architecture for the ECS based on the proposed parallel switch networkis then designed to significantly reduce the area, compared to the q-foldECU architecture [13].

The proposed parallel switch network structure is obtained byanalyzing the conditional equation Eq. (3) such that the indexes a′ ofone input vector L a( ′)1 are expressed following the indexes a′′ ofanother input vector L a( ″)2 and the index a of the desired value of theoutput vector L(a). Assume that the q indexes of vectors L a( ′)1 andL a( ″)2 are initialized in the order of α α α{0, , , ..., }q0 1 −2 , and q indexes ofvector L a( ″)2 always remain in the order of α α α{0, , , ..., }q0 1 −2 . Tocompute an output value with index a, the q pairs a′ - a′′ that satisfyEq. (3) are found. With given a and a′′, the indexes a′ of vector L a( ′)1are found by Eq. (6) and Eq. (7).

In the case of a = 0, a′ indexes are given as follows:

a hh

a α a′ = ″′

″ = ″i(6)

where α h h= ″/ ′i .In the case of a ≠ 0, a′ indexes are given as follows:

⎛⎝⎜

⎞⎠⎟a ha

hhha

a α α a′ =′

1 + ″ ″ = (1 + ″)i j

(7)

where α ha h= / ′i and α h ha= ″/j .It is cleared that the proposed parallel switch network is required to

permute a′ indexes from the order of α α α{0, , , ..., }q0 1 −2 to the desiredorder such as α a″i and α α a(1 + ″)i j for computing LLR values of vectorL(a). The permuted order of a′ is shown as a′new in Eq. (8) for bothcases.

⎧⎨⎩aα a for aα α a otherwise

′ =′ = 0

(1 + ′ )new

iinitial

i jinitial (8)

In the case of a = 0, the last (q – 1) nonzero indexes a′ of vector

Fig. 2. S-ECU architecture for GF(8) [14].

Fig. 3. q-fold ECU architecture for GF(8) [13].

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

55

Page 5: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

L a( ′)1 are multiplied by αi as shown in Eq. (6), which is equivalent to acyclic shift up by i positions. This cyclic shift is performed by thepermutation network, which is simply constructed by using a size (q –1) barrel shifter. The zero index (a′ = 0) are unchanged.

In the case of a ‡ 0, three steps are involved in changing the indexesfor L a( ′)1 . First, the last (q – 1) nonzero indexes a′ are multiplied by αi

to generate the new indexes α a′i , which is equivalent to a cyclic shift upby i positions. In this step, the first index remains the same. Second, afixed interconnection network is used to permute indexes α a′i toα a(1 + ′)i . This network is built by fixed wires, and it is independentof αi and α j values. It is remarked that the fixed network is specific foreach GF(q). For example, the fixed network for GF(8) is used topermute indexes a′ in the order of α α α α α α α{0, , , , , , , }0 1 2 3 4 5 6

to indexes a(1 + ′) in the order of α α α α α α α{ , 0, , , , , , }0 3 6 1 5 4 2 .Third, the last (q – 1) nonzero indexes generated by the second step areshifted up by j positions, and the first index of the second step outputremains unchanged. After finishing the third step, the indexes of theswitch network output become α α a(1 + ′)i j . Fig. 5(a) shows the switch-ing process of the indexes a′ of vector L a( ′)1 to generate the output LLRvalue L α( )1 at index a α= 1 over GF(8) under the conditional equationα a α a α α′ + ″ =3 6 6 1(where h α′ = 3, h α″ = 6, and h α= 6). Using Eq. (7),the new indexes of a′ are expressed as a α α a′ = (1 + ″)4 6 , where α α=i 4

and α α=j 6. The output of the proposed parallel switch network forindexes a′ in Fig. 5(a) are the same as the result computed in Fig. 4 fora α= 1.

The proposed P-ECU architecture over GF(8) is illustrated inFig. 5(b) based on the parallel switch network. In the P-ECUarchitecture, q LLR values of both input vectors such as L a( ′)1 andL a( ″)2 are given simultaneously, and the parallel switch network isused to permute the input vector L a( ′)1 in each cycle. In contrast, in theS-ECU architecture in Fig. 2 [14], q LLR values of L a( ′)1 are seriallygiven to the max block in q cycles, and the serial switch network is usedto permute the output vector L(a). In addition, the proposed P-ECUgenerates one valid output value L α( )r of the output vector L(a) in onecycle, followed by q valid output values of the output vector in q cycles;the S-ECU simultaneously generates q valid output values of the outputvector after q cycles. As a result, the S-ECU cannot generate any validoutput value before the completion of q cycles.

3.2. Proposed forward-backward four-way merger min-maxalgorithm

Savin [10] introduced the forward-backward recursive computa-tion, in which a current forward (backward) vector is calculated fromthe previous forward (backward) vector and the current V2C vector.Our finding is that the proposed P-ECU generates one valid output

value of the output forward (backward) vector in one cycle whereas theS-ECU requires one valid input value from the previous output forward(backward) vector in one cycle. Moreover, both the P-ECU and the S-ECU need q cycles to finish the computation of one ECS. Therefore, wepropose a combination of P-ECU and S-ECU to calculate two ECSs atthe same time. The advantage of this approach is to reduce the numberof steps for forward-backward computations from (dc – 1) steps[15,21] to d⌈ /2⌉c steps. As a result, a novel forward-backward four-way merger min-max algorithm is presented in Algorithm 3.

For the forward computation, with the check node degree dc, a totalof (dc – 1) ECSs is required to calculate (dc – 1) forward vectors. In thefirst step, the forward vector is equal to the V2C vector (F a L a( ) = ( )1 1 ).From the second step, two ECSs are simultaneously computed togenerate two consecutive forward vectors such as F a( )i and F a( )i+1 .Vector F a( )i is computed by a parallel computation (PC) correspondingto the P-ECU, and vector F a( )i+1 is computed by serial computation(SC) corresponding to the S-ECU. Furthermore, the output values ofvector F a( )i are directly fed into input of the SC to compute vectorF a( )i+1 . Therefore, the number of steps to calculate (dc – 1) forwardvectors is reduced to d⌈ /2⌉c steps, compared to [15,21].

The forward and backward computations are implemented in aparallel way, and the backward computation is similar to the forwardcomputation. After half of the forward and backward vectors arecomputed, the merger computation is started in two directions. Thefirst direction is the left merger with two ways to compute two leftmerger vectors M a( )l and M a( )l−1 . The second direction is the rightmerger with two ways to compute two right merger vectors M a( )k andM a( )k−1 . The last forward vector F a( )d −1c and the last backward vectorB a( )2 are the merger vectors Mdc and M1, respectively. It is noted thattwo merger vectors in each direction are computed in parallel becausethere are two consecutive forward vectors and two consecutive back-ward

Algorithm 3:. Forward-backward four-way merger min-maxalgorithm

Input: L a L a N m n d( ) = ( ) ∈ ( ); 1 ≤ ≤n nm c

Forward:

F a L aFor i i d i iPC F a F a L a

SC F a F a L a

( ) = ( );( = 2; < − 1; = + 2)

: ( ) = min (max( ( ′), ( ″)));

: ( ) = min (max( ( ′), ( ″)));

c

ih a h a h aa a GF q

i i

ih a h a h aa a GF q

i i

1 1

′+ ″= ′, ″∈ ( )−1

+1′+ ″= ′, ″∈ ( )

+1

i i i

i i i

−1

+1 +1

Fig. 4. Pairs a′ - a′′ satisfy the conditional equation Eq. (3) with h α h α h α′ = , ″ = , =3 6 6 for computing indexes a.

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

56

Page 6: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

Backward:

B a L aFor j d j j jPC B a B a L a

SC B a B a L a

( ) = ( );( = − 1; > 2; = − 2)

: ( ) = min (max( ( ′), ( ″)));

: ( ) = min (max( ( ′), ( ″)));

d d

c

jh a h a h aa a GF q

j j

jh a h a h aa a GF q

j j

′+ ″= ′, ″∈ ( )+1

−1′+ ″= ′, ″∈ ( )

−1

c c

j j j

j j j

+1

−1 −1

Merger:

M a B a M a F aβ d

( ) = ( ); ( ) = ( );= ⌊ /2⌋;

d d

c

1 2 −1c c

Two way-Left Merger:

For l β l l lPC M a B a F a

PC M a B a F a

( = ; ≥ 2; = − 2): ( ) = min (max( ( ′), ( ″)));

: ( ) = min (max( ( ′), ( ″)));

la a aa a GF q

l l

la a aa a GF q

l l

′+ ″= ′, ″∈ ( )+1 −1

−1′+ ″= ′, ″∈ ( )

−2

Two way-Right Merger:If dc is even:

For k β k d k kPC M a B a F a

PC M a B a F a

( = + 1; ≤ − 1; = + 2): ( ) = min (max( ( ′), ( ″)));

: ( ) = min (max( ( ′), ( ″)));

c

ka a aa a GF q

k k

ka a aa a GF q

k k

′+ ″= ′, ″∈ ( )+1 −1

−1′+ ″= ′, ″∈ ( )

−2

If dc is odd:

PC M a B a F a

For k β k d k kPC M a B a F a

PC M a B a F a

: ( ) = min (max( ( ′), ( ″)));

( = + 2; ≤ − 1; = + 2): ( ) = min (max( ( ′), ( ″)));

: ( ) = min (max( ( ′), ( ″)));

βa a aa a GF q

β β

c

ka a aa a GF q

k k

ka a aa a GF q

k k

+1′+ ″= ′, ″∈ ( )

+2

′+ ″= ′, ″∈ ( )+1 −1

−1′+ ″= ′, ″∈ ( )

−2

Output: R a M a n d( ) = ( ); 1 ≤ ≤nm n c

vectors generated in each step. Thus, four merger vectors arecomputed at the same time, and all merger vectors are calculatedduring the last d⌈ /4⌉c steps. In addition, the merger computations havebeen described for both cases of odd and even dc values. In the case ofan odd dc value, the forward-backward computations and the mergercomputations finished at the same time, and it takes a total of d⌈ /2⌉c

steps to complete the CNU. For an even dc value, after finishing theforward-backward computations, one more step is required to com-plete the merger computations. Thus, it takes d(⌈ /2⌉ + 1)c steps tocomplete the CNU. In general, a formula of d(⌊ /2⌋ + 1)c steps isexpressed to complete the CNU for both cases of odd and even dcvalues. Meanwhile, the CNU architectures in [15,21] require (dc – 1)steps to complete the check node processing. Thus, it is clear that theproposed FB4M-MM algorithm reduces the latency of the CNU byapproximately 50%, compared to [15,21].

Fig. 6 demonstrates the timing diagram for the forward-backwardfour-way merger min-max algorithm with the check node degree dc

Fig. 5. (a) Proposed parallel switch network, (b) proposed P-ECU architecture for GF(8).

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

57

Page 7: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

=27. In this figure, vectors F1 and B27 are the first forward vector andthe first backward vector, respectively. Forward vectors from F2 to F26

are produced during the forward recursion, and backward vectors fromB26 to B2 are produced during the backward recursion. Moreover, theforward and backward vectors are simultaneously computed, and twoconsecutive forward vectors Fi i, +1(i = 2; i < 26; i= i + 2) as well as twoconsecutive backward vectors Bj j, −1( j = 26; j > 2; j= j + 2) arecalculated in parallel at each step. When forward vectors F12, F13 andbackward vectors B15, B16 are completely calculated in step 7, themerger computations are started. Fig. 7 shows that the proposedFB4M-MM algorithm greatly reduces the latency in the CNU, com-pared to the forward-backward two-way merger (FB2M) algorithm [15]and bidirectional recursion [21].

4. Proposed check node unit architecture

4.1. Proposed parallel-serial elementary computation unit

To implement the ECS, several architectures are introduced such asS-ECU [11,14] and q-fold ECU [13] that whose best latency in theforward-backward computations always remains fixed at (dc – 1) steps.As mentioned above, we proposed a combination of P-ECU and S-ECUto calculate two ECSs at the same time, which reduces the latency from(dc – 1) to d⌈ /2⌉c steps for the forward-backward computations in theproposed FB4M-MM algorithm. Consequently, a novel parallel-serialECU (PS-ECU) architecture as shown in Fig. 8 is designed to imple-ment the forward-backward computations.

In the PS-ECU architecture, two consecutive output forward vectorssize q such as F a( )i and F a( )i+1 are carried out at the same time.Forward vector F a( )i is generated from the previous forward vectorF a( ′)i−1 and current V2C vector L a( ″)i by the P-ECU module. For each

cycle, one valid output value F α( )ir of output forward vector F a( )i is

derived, and q valid output values of vector F a( )i are generated after qcycles. To create forward vector F a( )i+1 , the previous forward vectorF a( )i and the current V2C vector L a( ″)i+1 are used as inputs for the S-ECU module. In each cycle, one valid output value F α( )i

r is directlygiven by the S-ECU module, and q valid output values of vector F a( )i+1are simultaneously generated after q cycles. The PS-ECU architecturefor the backward computations is the same for the forward computa-tions. Consequently, the proposed PS-ECU provides “two” outputvectors for the forward-backward computations in q + 1 cycles,whereas the S-ECU [11,14], q-fold ECU [13], and proposed P-ECUcan provide only “one” output vector for the forward-backwardcomputations in q cycles.

To reduce the critical path delay, the PS-ECU is pipelined betweentwo internal ECUs. The critical path delay of the PS-ECU for GF(8) isestimated to be T T T+ + 3SN Max Min, where TSN, TMax, and TMin are theprocess times of the switch network block, max block, and min block,respectively, as shown in Fig. 5.

4.2. Proposed check node unit architecture

Fig. 9 shows the proposed CNU architecture for the FB4M-MMalgorithm. The proposed CNU architecture consists of one forwardmodule, one backward module, and four merger modules. In each step,the proposed CNU architecture accepts four V2C input vectors, ofwhich two V2C vectors L a( )i and L a( )i+1 are delivered to the PS-ECUfor the forward computation, and the other vectors L a( )j and L a( )j−1are delivered to the PS-ECU for the backward computation. First, theforward-backward computations are executed during d⌈ /2⌉c steps toconsecutively generate the forward vectors F F F( , , ... , )1 2 26 and back-ward vectors B B B( , , ... , )27 26 2 .

Fig. 6. Timing diagram for computation of the forward-backward four-way merger min-max algorithm with dc = 27.

Fig. 7. Timing diagram for check node processing of different forward-backward schemes, (a) proposed forward-backward four-way merger min-max algorithm, (b) bidirectionalrecursive forward-backward algorithm [15,21], (c) forward-backward algorithm [14].

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

58

Page 8: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

The computation results in the first seven steps such as forwardvectors F F F( , , ... , )1 2 13 and backward vectors B B B( , , ... , )27 26 15 arestored in a forward memory (FMEM) and a backward memory(BMEM), respectively. For this process, the multiplexer control signalis kept at a logical high. The control signal is then returned to the logicallow during the last seven steps, such that the remaining outputs such asforward vectors F F F( , , ... , )14 15 26 and backward vectorsB B B( , , ... , )14 13 2 are directly given to the right and left merger modules,respectively. The other input vectors of the right and left merger modulescome from the BMEM and the FMEM, respectively. During the lastseven steps, four P-ECUs are simultaneously processed to generate fourmerger vectors such as M a M a M a M a( ), ( ), ( ), ( )l l k k−1 −1 in each step.

Two merger vectors M a( )l and M a( )l−1 are computed by two P-ECUs inthe left merger, and the other vectors M a( )k and M a( )k−1 are computedby two P-ECUs in the right merger. The output vectors of the left mergermodules M M M( , , ... , )13 12 1 and the right merger modulesM M M( , , ... , )14 15 27 are the C2V vectors, which are further given tovariable nodes for processing. Because only 13 the forward and back-ward vectors are stored in the memories, the memory depth of theFMEM and the BMEM is 13, and the memory width is the size of onevector including q LLR values.

5. Proposed partial-parallel block-layered decoderarchitecture

The proposed partial-parallel block-layered decoder, which exploitsthe FB4M-MM algorithm for (837, 726) QC-NB-LDPC code overGF(32) is shown in Fig. 10. The proposed decoder architecture includes31 CNUs corresponding to 31 check nodes in one layer, which areexecuted in parallel. The channel messages or the variable nodemessages after each layer decoding are stored in variable nodememories (VN-MEMs). Let q( − 1) columns of H matrix be one blockcolumn. Then, q( − 1) variable node vectors in one block column arestored in one VN-MEM, and a total of dc VN-MEMs are required tostore the channel messages or the variable node messages after eachlayer decoding. Assume that LLR values in the variable node processingare quantized by w bits. Each VN-MEM has a size ofq q w( − 1) × × = 31 × 32 × 5 = 4960 bits, where w = 5 bits. For theC2V messages, since a layered decoder

with dv layers is implemented, all the C2V messages in dv layersneed to be stored. Therefore, dc check node memories (CN-MEMs) arerequired to store the C2V messages. Let wc be the quantization bits ofthe C2V messages. Forwc = 3 bits, each CN-MEM has a depth of dv anda width of q q w( − 1) × × = 31 × 32 × 3 = 2976c bits.

As described in Section 4, four V2C input vectors are given for theCNU in each step, and during the last seven steps four C2V vectors aresimultaneously generated in each step. From this observation, wepropose a partial-parallel VNU architecture with a factor of four, inwhich all modules for the variable node processing such as addition,subtraction, and normalization are reduced from dc modules[14,15,21] to four modules, as shown in Fig. 10. As a result, thisarchitecture achieves very low hardware complexity, compared toprevious works [14,15,21]. A variable node scheduler (VN-scheduler)and a check node scheduler (CN-scheduler) are responsible forscheduling the input and output of VN-MEMs and CN-MEMs in eachstep, respectively. With dc = 27, the scheduling for outputs of VN-MEMs is presented in Fig. 11. In the first step, two VN-MEMs such as

Fig. 8. Proposed PS-ECU architecture for forward and backward computation.

Fig. 9. Proposed CNU architecture for forward-backward four-way merger min-max algorithm.

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

59

Page 9: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

VN-MEM 1 and VN-MEM 27 are scheduled for the first forward andbackward computations F B( , )1 27 in 31 CNUs. From the second step,four VN-MEMs are scheduled to compute the corresponding forwardand backward vectors in each step for 31 CNUs. The scheduling foroutputs of CN-MEMs is similar to the VN- scheduler.

After finishing the subtractions, normalization is performed toensure that the smallest value in each V2C vector is zero before thecheck node processing. From the eighth step in the case of dc = 27, theC2V vectors generated in each step are immediately stored in theappropriate CN-MEMs. On the other hand, these C2V vectors derivethe additions along with the V2C vectors stored in the temporarymemory as registers to implement the variable node updates. Finally,the updated variable node messages as a posteriori messages arestored into appropriate VN-MEMs. The scheduling for inputs of VN-MEMs and CN-MEMs starts from the eight step in the case of dc = 27when the C2V messages and the updated variable node messages areavailable. Furthermore, the permute modules are necessary to routethe a posteriori messages to the related CNUs based on the nonzeroentries of matrix H. Generally, the a posteriori messages after each

decoding layer need to perform reverse permutation to reverse thefunction of the permutation. In our work, the layered decoding schemeis applied, which the a posteriori messages in l-th layer are used togenerate the V2C messages of the same columns ofH in (l + 1)-th layer.Thus, the reverse permutation of l-th layer can be combined with thepermutation of (l + 1)-th layer, as described in [12]. For example, thenonzero entries in one column of H in l-th and (l + 1)-th layer are hland hl+1, respectively. Then, the constant value used for the permuta-tion of that column in (l + 1)-th layer is h h/l l+1 . Hence, there is noreverse permutation blocks in Fig. 10.

The quantization scheme is an essential procedure for hardwareimplementation before establishing hardware architectures. This isbecause the bit size variously influences the decoding performance andhardware performance such as area efficiency, throughput rate, andpower consumption. In other words, if the bit size is small, theefficiency of the hardware implementation will be good. However, theerror-correcting performance such as bit error rate (BER) and frameerror rate (FER) is worse with a smaller bit size. Consequently, thequantization schemes are investigated to find the trade-off between the

Fig. 10. Proposed partial-parallel block-layered (837, 726) QC-NB-LDPC decoder architecture.

Fig. 11. Scheduling for the outputs of VN-scheduler with dc = 27.

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

60

Page 10: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

error-correcting performance and the hardware complexity. Fig. 12shows the FER performance of (837, 726) QC-NB-LDPC code over anadditive white Gaussian noise (AWGN) channel and a binary phaseshift keying (BPSK) modulation. In this work, the decoding perfor-mance of the floating-point scheme is simulated by the CUDAprogramming on a NVIDIA GTX TITAN graphics processing unit(GPU) [24]. The general-purpose GPUs [24,25] can perform floating-point arithmetic operations, and accelerates the decoding process byparallel computations in both the check node processing and thevariable node processing. Therefore, the simulation on the GPUprovides a faster decoding runtime and obtains better accuracy and alower FER at 10−7, whereas previous works [17,21] approximatelyachieved FER at 10−5 using C programming on a central processingunit (CPU).

For quantization schemes, the simulations are implemented by Cprogramming on the CPU. First, a quantization scheme for the channeland a posteriori messages, the V2C messages, and the C2V messagesare the same with 5 bits. The results show that the decodingperformance using 5-bit quantization decreases by approximately0.1 dB, which is negligible, compared to the floating-point scheme.However, our finding is that the C2V messages in each check node arethe smallest V2C messages. According to this observation, we canreduce the bit size of the C2V messages. Therefore, a quantizationscheme, that the number of quantization bits for each LLR value of thechannel and a posteriori, V2C, and C2V messages are 5 bits, 5 bits, and3 bits, respectively, is implemented. The simulation results show thatthis quantization scheme is the best trade-off between error-correctingperformance and hardware complexity. Fig. 12 demonstrates that thedecoding performance of this scheme is similar to that in the 5-bitquantization scheme for each LLR value in all types of messages.Reducing the bit size of C2V messages provides a reduction in thenumber of bits stored in the C2V memory as well as the area of thedecoder. The TMM algorithm [17] has been proposed to simplify thehardware implementation. However, a degradation in the error-cor-recting performance is introduced because of only nm < q reliablemessages kept, and only 1.5 nm most reliable V2C messages consideredfor the CNP, compared to the forward-backward min-max algorithm.Moreover, a larger number of clock cycles are required for the checknode processing, which causes the throughput problem in the TMMalgorithm [17]. The FER performance with floating-point simulation ofthe max-log QSPA [21] at 5 iterations is almost similar with the FB4M-MM algorithm at 10 iterations. In [21], the quantization scheme of 7

bits for the channel and a posteriori messages, 5 bits for the V2Cmessages, and 4 bits for the C2V messages is chosen for the hardwareimplementation. The fixed-point simulation results show that theproposed decoder has less hardware complexity than the Max-logQSPA decoder with a very small degradation in the FER performance.

6. Results and comparisons

6.1. Implementation results

The proposed partial-parallel block-layered (837, 726) QC-NB-LDPC decoder architecture using the forward-backward four-waymerger min-max algorithm was modeled in Verilog HDL and simulatedto verify its functionalities using a test pattern generated from a Csimulator. After the functional verification of the design functionalitywas completed, it was synthesized with appropriate timing and areaconstraints. Both the simulation and synthesis steps are performedusing the Synopsys design tools and TSMC 90-nm CMOS standard celllibrary.

The proposed parallel switch network is designed to facilitate the P-ECU architecture. This switch network includes two barrel shifters andone fixed-interconnection network, which contribute a small portion ofthe area. The proposed partial-parallel scheme in the variable nodeprocessing was applied to significantly reduce the overall area of thedecoder. Moreover, block-layered decoding was used to improve notonly the speed of convergence but also the memory requirements. Thequantization scheme with 5 bits for the channel and a posteriorimessages, 5 bits for the V2C messages, and 3 bits for the C2V messagesis chosen.

The synthesis results of the proposed decoder architecture for (837,726) QC-NB-LDPC code is shown in Table 1, and the comparison withother works is illustrated in Table 2. The total equivalent gate countsfor the proposed NB-LDPC decoder architecture is almost 2.74 M gatecounts (2-input NAND gate equivalents), which includes the estimatedmemory area of 548.288 Kbits of memory. In the proposed design, eachbit of RAM is implemented as D flip-flops. For fair comparisonpurposes, according to [21,26], each memory bit is equivalent 1.5NAND2 gates (i.e., 6 transistors for an SRAM cell versus 4 transistorsfor a NAND2 gate). The CNU architectures require the FMEM andBMEM memories to store the forward and backward messages. Thesize of the FMEM memory for 31 CNUs in the case of dc = 27 is 160 ×13 × 31 = 64,480 bits. Therefore, a total of 64,480 bits × 2 = 128,960bits are used for both FMEM and BMEM memories. The total numberof memory bits of 27 VN-MEMs is 160 × 31 × 27 = 133,920 bits.Similarly, the bit size of 27 CN-MEMs is 96 × 31 × 27 = 80,352 bits.Because the layered scheme with four layers is implemented in this

3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.510

-6

10-5

10-4

10-3

10-2

10-1

100

Eb/No(dB)

FER

FB-MM,15it-fpFB-MM,10it-fpTMM,15it-fpMax-log QSPA,5it-fpFB-MM,10it-5bFB-MM,10it-5b5b3bMax-log QSPA,5it-7b5b4b

Fig. 12. FER performance of (837, 726) QC-NB-LDPC code over GF(32).

Table 1Implementation results of the proposed (837, 726) QC-NB-LDPC decoder architecture.

Algorithm Forward-backward four-waymerger Min-Max

Scheduling LayeredCode length 837Quantization bits 5b5b3bProcess (nm) 90-nmFrequency (MHz) 370Iterations 10Total Clock Cycles 16720Throughput (Mbps) 92.6

31 CNUs (eachCNU)

1658 K (53.48 K)

Gate count VN-MEM 201 KCN-MEM 482 KFMEM 97 KBMEM 97 KVNU 202 KController 5 K

Total gate count 2.74 M

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

61

Page 11: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

work, the total number of memory bits in CN-MEMs for four layers is80,352 × 4 = 321,408 bits. As a result, the total number of memory bitsfor the proposed (837, 726) NB-LDPC decoder architecture is 128,960+ 133,920 + 321,408 = 548,288 bits. The equivalent gate count ofmemories has been estimated by using a 90-nm CMOS library. Theresults of the pre-layout simulation show that the proposed NB-LDPCdecoder architecture can operate at a clock rate of 370 MHz and athroughput of 92.6 Mbps at 10 iterations. The throughput of theproposed decoder can be calculated as follows:

Throughputd q q f

N d N=

× ( − 1) × log ×× ×

c clk

cycles v itr

2

(9)

where Ncycles and Nitr are the number of cycles per layer and number ofiterations of the decoder, respectively. In this work, the check nodeprocessing requires 14 steps to execute a single layer. The first steptakes only one cycle, whereas the other steps take 32 cycles, and oneextra cycle is added because of the pipeline between P-ECU and S-ECU.Thus, the number of cycles per layer is Ncycles = 1 + 13 × 32 + 1 = 418cycles. Because our decoder has four layers (dv = 4) per iteration, thetotal number of cycles per iteration is 418 × 4 = 1,672 cycles. It is notedthat modules in the VNU and permute modules are combinationalcircuits, so there is no clock cycle spending for these modules. Inaddition, the efficiency of the decoder is calculated by the throughput-to-gate-count ratio (Mbps/M gates). The synthesized results for (744,653) QC-NB-LDPC code is computed in the same way, as shown inTable 2. As a result, with increasing value of dc, the complexities of thecomputational modules in the decoder are unchanged. Increasing valueof dc results in an increase in not only the complexity of the storagemodules as VN-MEM and CN-MEM but also the number of the clockcycles Ncycles. Consequently, it can be seen in Table 2 that the efficiencyof the (744, 653) QC-NB-LDPC decoder is higher than that of the (837,726) QC-NB-LDPC decoder.

6.2. Comparison with other related works

In [14], the S-ECU architecture and the corresponding switchnetwork using a barrel-shifter-based permutator for the selective-inputdecoder are used. However, in our decoder, the P-ECU architecture andthe corresponding parallel switch network are proposed to process twoinput vectors in parallel. Although Ueng et al. [14] used a selective-input decoder to reduce the ECS time to nm cycles and accept a slightdecline in the FER performance, the total number of ECSs per singlelayer is still kept at 3 × (dc – 2) steps, which corresponds to (3 × (dc –2) × nm + e) cycles. In our work, a forward-backward four-way mergeralgorithm and a PS-ECU architecture are proposed to reduce thenumber of ECSs per single layer to d( ⌊ /2⌋ + 1 )c steps, which isequivalent to d q( ⌊ /2⌋ × + 1 + 1 )c cycles. It is clear that the proposedalgorithm can reduce the number of ECSs and the number of decodingcycles by 81.3% and 65%, respectively. For example, with dc = 27, the

decoder in [14] requires 1200 cycles per single iteration in the case ofnm = 16. In contrast, the proposed decoder requires only 418 cycles.Table 2 shows an increase in throughput of 68.68% compared to [14].However, the CNU architecture requires eight ECUs for computing theforward, backward and merger messages. As a result, the proposeddecoder architecture using FB4M-MM algorithm can significantlyincrease the decoding throughput at the cost of increased decodercomplexity. Fortunately, a partial-parallel decoder architecture isproposed in this work, which reduces the number of processed blocksfrom dc to four in the variable node processing. Consequently, theoverall area of the proposed decoder even decreases by 16.5%, and theefficiency is improved approximately four times than that of [14].

In [21], bidirectional recursion was proposed to reduce the numberof ECSs per layer to (dc – 1) steps, and a selective-input decoder, whichmaintains only the nm < q most reliable symbols, was used. The max-log QSPA algorithm is well-known to have the best error-correctingperformance [5]. Therefore, the FER performance of the max-log QSPAat 5 iterations while keeping the nm < q most reliable symbols isalmost similar to that of the FB4M algorithm at 10 iterations. As aresult, the decoder in [21] required fewer clock cycles for the decodingthan the proposed decoder at the same FER performance, although theproposed decoder reduces the number of steps for the decoding byalmost half. However, the hardware implementation of the decoder[21] is highly complex for both the check node processing and thevariable node processing with the filters of the nm < q most reliablesymbols, compression and decompression modules. Thus, this decodercould achieve high throughput at the cost of higher area. Table 2 showsthat the proposed decoder can provide a significant improvement interms of the efficiency over the decoder in [21] because of the partial-parallel block-layered decoder architecture exploited along with thesimplified check node processing.

Lin in [27] carried out a modified shuffled scheduling (MSS) for theNB-LDPC decoder, whose error-correcting performance is similar tothe flooding min-max algorithm. In our work, a layered decoder isapplied, which reduces the number of decoding iterations by halfcompared to the MSS decoder at the same FER performance. The MSSdecoder can reach efficiency comparable to that of the proposed layereddecoder. Compared to the (837, 726) QC-NB-LDPC decoders presentedin [13,17], the proposed decoder achieved higher throughput andhigher efficiency based on similar error-correcting performance.Although the q-fold ECU [13] takes only one cycle for the ECS, thenumber of elementary steps required is higher as explained previously.In addition, a non-layered decoding scheme, in which a grouping factorof g (g = 4) rows is processed in parallel, is implemented with threestages of pipelines. Thus, the decoder in [13] needs much more cyclesto finish one decoding iteration, compared to the proposed decoder.Other approaches [28,29] have proposed some kind of algorithm toreduce the exchange messages between the check node and variablenode and vice versa. This provides a decrease in the wiring congestion

Table 2Synthesized results and comparison with other NB-LDPC decoders.

Design Lin [27] Ueng[21] Zhang [17] Chen [13] Ueng [14] Proposed ProposedReport Synthesis Post-layout Synthesis Synthesis Synthesis Synthesis Synthesis

Algorithm MM Max-log QSPA Trellis-MM MM MM MM MMSchedule MSS Layered Layered Layered Layered Layered LayeredCode size (837, 726) (837, 726) (837, 726) (837, 726) (837, 726) (837, 726) (744, 653)Process (nm) 130 90 N/A 180 90 90 90Quantization 5b 7b5b4b 5b 5b 7b5b 5b5b3b 5b5b3bFrequency (MHz) 500 250 150 200 260 370 370Iterations 15 5 15 10 15 10 10Total Clock Cycles 28215 4460 62240 53541 37500 16720 11580Throughput (Mbps) 64.3 233.53 10 16 29 92.6 118.86FER (SNR=4.3 dB) N/A 6.0E-04 9.8E-04 N/A 8.49E-04 1.0E-03 5.6E-04Gate count 2.13 M 8.51 M 1.6 M 1.37 M 3.28 M 2.74 M 2.54 MEfficiency (Mbps/M gates) 30.2 27.44 6.25 4.67 8.84 33.79 46.79

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

62

Page 12: INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

and the required memory resources. However, these works introduce anon-negligible performance loss, which depends on the Galois-Fieldorder and a portion of messages kept such as nm < q [29] and even nv< nm < q [28] of the reliabilities. Moreover, the forward-backwardscheme in [29] requires (dc – 1) steps for forward-backward computa-tions. After finishing the forward-backward computations, dc mergervectors are calculated by dc ECU modules in parallel. As a result, thelatency of proposed decoder architecture has almost a half compared tothe conventional decoder architecture [28]. In addition, when increas-ing dc value, the conventional decoder area [28] significantly increases,because dc ECU modules are required for the merger computationsinstead of only 4 ECU modules in the proposed decoder. Therefore, theforward-backward scheme [29] is suitable for NB-LDPC codes with asmall dc value, whereas the proposed FB4M-MM scheme has signifi-cant improvements for the small dc value and much better advantagefor the high dc value.

7. Conclusions

In this paper, a forward-backward four-way merger min-maxalgorithm is proposed to reduce the number of ECS steps to

d(⌊ /2⌋ + 1)c steps and improve the throughput rate for decoding. Aparallel switch network and a PS-ECU architecture are designed toimplement the forward-backward computations. Moreover, the partial-parallel block-layered decoder architecture using the proposed FB4M-MM algorithm is introduced to achieve a high efficiency in terms ofthroughput and hardware complexity. A layered scheme is applied toreduce the number of iterations for a given level of error-correctingperformance. Consequently, the proposed decoder architecture has ahigh throughput while requiring low hardware complexity. Two (837,726) and (744, 653) NB-QC-LDPC decoder architectures are imple-mented using the proposed algorithm. The synthesis results show thatthese two decoders achieved much higher throughputs of 92.6 Mbpsand 118.86 Mbps at 370 MHz, respectively, with comparable error-correcting performance compared to previous related works.

Acknowledgement

This research was supported by the Basic Science ResearchProgram through the NRF funded by the Ministry of Science, ICTand Future Planning under Grant 2016R1A2B4015421.

References

[1] R.G. Gallager, Low-density parity-check codes, Inf. Theory IRE Trans. 8 (1962)21–28.

[2] S.-I. Hwang, H. Lee, Block-circulant RS-LDPC code: code construction and efficientdecoder design, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21 (2013)1337–1341.

[3] S. Ajaz, H. Lee, An efficient radix-4 Quasi-cyclic shift network for QC-LDPCdecoders, IEICE Electron. Express 11 (2014) 1–6.

[4] S. Ajaz, H. Lee, Reduced-complexity local switch based multi-mode QC-LDPCdecoder architecture for Gbit wireless communication, Elect. Lett. 49 (2013)1246–1248.

[5] H.C. Davey, D.J. MacKay, Low density parity check codes over GF (q), in: IEEEInformation Theory Workshop, (1998), pp. 70-71.

[6] L. Barnault, D. Declercq, Fast decoding algorithm for LDPC over GF (2q),in: Proceedings IEEE Information Theory Workshop, (2003), pp. 70-73.

[7] H. Song, J. Cruz, Reduced-complexity decoding of Q-ary LDPC codes for magneticrecording, IEEE Trans. Magn. 39 (2003) 1081–1087.

[8] H. Wymeersch, H. Steendam, M. Moeneclaey, Log-domain decoding of LDPC codesover GF (q), in: IEEE International Conference on Comm, pp. 772–776, 2004.

[9] D. Declercq, M. Fossorier, Decoding algorithms for nonbinary LDPC codes over GF,IEEE Trans. Comm. 55 (2007) 633–643.

[10] V. Savin, Min-Max decoding for non binary LDPC codes, in: IEEE InternationalSymposium on Information Theory (ISIT), pp. 960–964, 2008.

[11] J. Lin, J. Sha, Z. Wang, L. Li, Efficient decoder design for nonbinary quasicyclicLDPC codes, IEEE Trans. Circuits Syst. I: Regul. Pap. 57 (2010) 1071–1082.

[12] X. Zhang, F. Cai, Efficient partial-parallel decoder architecture for quasi-cyclicnonbinary LDPC codes, IEEE Trans. Circuits Syst. I: Regul. Pap. 58 (2011)402–414.

[13] X. Chen, S. Lin, V. Akella, Efficient configurable decoder architecture for nonbinaryQuasi-Cyclic LDPC codes, IEEE Trans. Circuits Syst. I: Regul. Pap. 59 (2012)188–197.

[14] Y.-L. Ueng, C.-Y. Leong, C.-J. Yang, C.-C. Cheng, K.-H. Liao, S.-W. Chen, Anefficient layered decoding architecture for nonbinary QC-LDPC codes, IEEE Trans.Circuits Syst. I: Regul. Pap. 59 (2012) 385–398.

[15] C.-S. Choi, H. Lee, Block-layered decoder architecture for quasi-cyclic nonbinaryLDPC codes, J. Signal Process. Syst. 78 (2015) 209–222.

[16] F. Cai, X. Zhang, Efficient check node processing architectures for non-binaryLDPC decoding using power representation, J. Signal Process. Syst. 76 (2014)211–222.

[17] X. Zhang, F. Cai, Reduced-complexity decoder architecture for non-binary LDPCcodes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19 (2011) 1229–1238.

[18] J. Lacruz, F. Garcia-Herrero, M. Canet, J. Valls, A. Perez-Pascual, A 630 Mbps non-binary LDPC decoder for FPGA, in: IEEE International Symposium on Circuits andSystems (ISCAS), pp. 1989–1992, 2015.

[19] L. Zhou, J. Sha, Y. Chen, C. Zhang, Z. Wang, Efficient symbol reliability baseddecoding for QCNB-LDPC codes, in: IEEE International Symposium on Circuitsand Systems (ISCAS), pp. 405–408, 2014.

[20] C.-.W. Yang, X.-.R. Lee, C.-.L. Chen, H.-.C. Chang, C.-.Y. Lee, Area-efficient TFM-based stochastic decoder design for non-binary LDPC codes, in: IEEE InternationalSymposium on Circuits and Systems (ISCAS), pp. 409–412, 2014.

[21] Y.-L. Ueng, K.-H. Liao, H.-C. Chou, C.-J. Yang, A high-throughput trellis-basedlayered decoding architecture for non-binary LDPC codes using max-log-QSPA,IEEE Trans. Signal Process. 61 (2013) 2940–2951.

[22] C. Poulliat, M. Fossorier, D. Declercq, Design of regular (2, dc)-LDPC codes overGF(q) using their binary images, IEEE Trans. Comm. 56 (2008) 1626–1635.

[23] B. Zhou, L. Zhang, J. Kang, Q. Huang, S. Lin, K. Abdel-Ghaffar, Array dispersions ofmatrices and constructions of quasi-cyclic LDPC codes over non-binary fields, in:IEEE International Symposium on Information Theory (ISIT), pp. 1158–1162,2008.

[24] H.P. Thi, S. Ajaz, H. Lee, Efficient Min-Max nonbinary LDPC decoding on GPU, in:IEEE SoC Design Conference (ISOCC), pp. 266–267, 2014.

[25] B. Le Gal, C. Jego, J. Crenne, A high throughput efficient approach for decodingLDPC codes onto GPU devices, IEEE Embed. Syst. Lett. 6 (2014) 29–32.

[26] X. Chen, C.-L. Wang, High-throughput efficient non-binary LDPC decoder based onthe simplified min-sum algorithm, IEEE Trans. Circuits Syst. I: Regul. Pap. 59(2012) 2784–2794.

[27] J. Lin, Z. Yan, Efficient shuffled decoder architecture for nonbinary quasi-cyclicLDPC codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21 (2013)1756–1761.

[28] J. Lin, Z. Yan, An efficient fully parallel decoder architecture for nonbinary LDPCcodes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22 (2014) 2649–2660.

[29] Y.S. Park, Y. Tao, Z. Zhang, A fully parallel nonbinary LDPC decoder with fine-grained dynamic clock gating, IEEE J. Solid-State Circuits 50 (2015) 464–475.

H. Pham Thi et al. INTEGRATION the VLSI journal 59 (2017) 52–63

63