10
2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms Fast SAO Estimation Algorithm and Its Implementation for 8K × 4 K @ 120 FPS HEVC Encoding Jiayi ZHU a) , Nonmember, Dajiang ZHOU , Member, Shinji KIMURA , Senior Member, and Satoshi GOTO , Fellow SUMMARY High eciency video coding (HEVC) is the new genera- tion video compression standard. Sample adaptive oset (SAO) is a new compression tool adopted in HEVC which reduces the distortion between original samples and reconstructed samples. SAO estimation is the pro- cess of determining SAO parameters in video encoding. It is divided into two phases: statistic collection and parameters determination. There are two diculties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of Rate Distortion Optimization (RDO) in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first diculty, we use bitmaps to collect statistics of all the 16 samples in one 4 × 4 block simultaneously. For the second di- culty, we simplify a series of complicated procedures in HM to balance the algorithms complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improve- ment. The VLSI design based on this algorithm can be implemented using 156.32 K gates, 8,832 bits single port RAM for 8 bits depth case. It can be synthesized to 400 MHz @ 65 nm technology and is capable of 8 K × 4K @ 120 fps encoding. key words: high eciency video coding, sample adaptive oset, Rate Dis- tortion Optimization (RDO) VLSI architecture 1. Introduction With the rapid development of video compression technol- ogy, the resolution and frame rate of popular video format increase quickly in the past twenty years. Ultra HDTV (Ul- tra High Definition Television) [1], a new video format con- ceptualized by the Japanese public broadcasting network, NHK, supports as high as 8 K × 4 K @ 120 FPS video throughput [2]. So it is of significance to work on the VLSI technology on 8 K × 4 K @ 120 FPS video coding. High eciency video coding (HEVC) [3] is a video compression format, a successor to H.264/MPEG-4 AVC, that was jointly developed by the ISO/IEC moving picture experts group and ITU-T video coding experts group as ISO/IEC 23008-2 MPEG-H Part 2 and ITU-T H.265 [4]. Sample Adaptive Oset (SAO) [5] is a new in-loop filtering technique that reduces the distortion between original sam- ples and reconstructed samples in HEVC. The concept of SAO is to reduce mean sample distortion of a region by first classifying the region samples into multiple categories with Manuscript received March 13, 2014. Manuscript revised June 30, 2014. The authors are with the Graduate School of Information, Production and Systems, Waseda Univ., Kitakyushu-shi, 808-0135 Japan. a) E-mail: [email protected] DOI: 10.1587/transfun.E97.A.2488 Fig. 1 Four consecutive bands among all of the 32 bands. Fig. 2 Four classes of neighboring samples. Fig. 3 Four categories of edge oset. a selected classifier, obtaining an oset for each category, and then adding the oset to each sample of the category, where the classifier index and the osets of the region are coded in the bitstream. Practically, SAO parameters are Coding Tree Block (CTB) based. There are three types of SAO, band oset (BO), edge oset (EO) and SAO not applied (NA). If SAO type is SAO not applied, then no samples are needed to be oset in the SAO process. If SAO type is band oset, as shown in Fig. 1, all the samples are equally divided into 32 ranges and each range is called a band. Among the 32 bands, four consecutive bands are selected as four categories. Four dierent osets are determined for each of these categories. If SAO type is edge oset, there are four classes (direc- tions) of neighboring samples. As shown in Fig. 2, they are horizontal, vertical, diagonally 135 and diagonally 45. As shown in Fig. 3, the relationship between each sample and its two neighboring samples is divided to four categories. Four dierent osets are determined for each of these cate- gories. SAO parameters include five kinds of syntax elements listed as following: Copyright c 2014 The Institute of Electronics, Information and Communication Engineers

PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

2488IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014

PAPER Special Section on VLSI Design and CAD Algorithms

Fast SAO Estimation Algorithm and Its Implementation for8 K × 4 K @ 120 FPS HEVC Encoding

Jiayi ZHU†a), Nonmember, Dajiang ZHOU†, Member, Shinji KIMURA†, Senior Member,and Satoshi GOTO†, Fellow

SUMMARY High efficiency video coding (HEVC) is the new genera-tion video compression standard. Sample adaptive offset (SAO) is a newcompression tool adopted in HEVC which reduces the distortion betweenoriginal samples and reconstructed samples. SAO estimation is the pro-cess of determining SAO parameters in video encoding. It is divided intotwo phases: statistic collection and parameters determination. There aretwo difficulties for VLSI implementation of SAO estimation. The first isthat there are huge amount of samples to deal with in statistic collectionphase. The other is that the complexity of Rate Distortion Optimization(RDO) in parameters determination phase is very high. In this article, afast SAO estimation algorithm and its corresponding VLSI architecture areproposed. For the first difficulty, we use bitmaps to collect statistics of allthe 16 samples in one 4 × 4 block simultaneously. For the second diffi-culty, we simplify a series of complicated procedures in HM to balancethe algorithms complexity and BD-rate performance. Experimental resultsshow that the proposed algorithm maintains the picture quality improve-ment. The VLSI design based on this algorithm can be implemented using156.32 K gates, 8,832 bits single port RAM for 8 bits depth case. It can besynthesized to 400 MHz @ 65 nm technology and is capable of 8 K × 4 K@ 120 fps encoding.key words: high efficiency video coding, sample adaptive offset, Rate Dis-tortion Optimization (RDO) VLSI architecture

1. Introduction

With the rapid development of video compression technol-ogy, the resolution and frame rate of popular video formatincrease quickly in the past twenty years. Ultra HDTV (Ul-tra High Definition Television) [1], a new video format con-ceptualized by the Japanese public broadcasting network,NHK, supports as high as 8 K × 4 K @ 120 FPS videothroughput [2]. So it is of significance to work on the VLSItechnology on 8 K × 4 K @ 120 FPS video coding.

High efficiency video coding (HEVC) [3] is a videocompression format, a successor to H.264/MPEG-4 AVC,that was jointly developed by the ISO/IEC moving pictureexperts group and ITU-T video coding experts group asISO/IEC 23008-2 MPEG-H Part 2 and ITU-T H.265 [4].Sample Adaptive Offset (SAO) [5] is a new in-loop filteringtechnique that reduces the distortion between original sam-ples and reconstructed samples in HEVC. The concept ofSAO is to reduce mean sample distortion of a region by firstclassifying the region samples into multiple categories with

Manuscript received March 13, 2014.Manuscript revised June 30, 2014.†The authors are with the Graduate School of Information,

Production and Systems, Waseda Univ., Kitakyushu-shi, 808-0135Japan.

a) E-mail: [email protected]: 10.1587/transfun.E97.A.2488

Fig. 1 Four consecutive bands among all of the 32 bands.

Fig. 2 Four classes of neighboring samples.

Fig. 3 Four categories of edge offset.

a selected classifier, obtaining an offset for each category,and then adding the offset to each sample of the category,where the classifier index and the offsets of the region arecoded in the bitstream.

Practically, SAO parameters are Coding Tree Block(CTB) based. There are three types of SAO, band offset(BO), edge offset (EO) and SAO not applied (NA). If SAOtype is SAO not applied, then no samples are needed to beoffset in the SAO process. If SAO type is band offset, asshown in Fig. 1, all the samples are equally divided into 32ranges and each range is called a band. Among the 32 bands,four consecutive bands are selected as four categories. Fourdifferent offsets are determined for each of these categories.

If SAO type is edge offset, there are four classes (direc-tions) of neighboring samples. As shown in Fig. 2, they arehorizontal, vertical, diagonally 135 and diagonally 45. Asshown in Fig. 3, the relationship between each sample andits two neighboring samples is divided to four categories.Four different offsets are determined for each of these cate-gories.

SAO parameters include five kinds of syntax elementslisted as following:

Copyright c© 2014 The Institute of Electronics, Information and Communication Engineers

Page 2: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

ZHU et al.: FAST SAO ESTIMATION ALGORITHM AND ITS IMPLEMENTATION FOR 8 K × 4 K @ 120 FPS HEVC ENCODING2489

• SAO types (EO, BO or NA),• Classes (i.e. directions) of EO,• Start band position (sbp) for BO,• Offsets value for four categories• Left or upper merge or not.

Parameter 5 determines that whether current CTB isleft merge mode, upper merge mode or no merge mode.As shown in Fig. 4, if current CTB is left merge mode orupper merge mode, only parameter 5, 1-bit syntax elementsao left merge flag or sao upper merge flag, is transmitted.Under these cases, the parameters 1–4 of current CTB arecopied from the parameters 1–4 of left or upper CTB andare not needed to be encoded. Only when current CTB isno merge mode, parameters 1–4 are encoded. Hence the bitnumber of SAO parameters of left or upper merge mode ismuch lower than that of no merge mode.

Since SAO is a new encoding tool in video coding stan-dard, the related research works on SAO estimation are lim-ited. Zhu [6], Park [7] and Mihir [8] worked on SAO de-coding VLSI design rather than encoding design. Praveen[9] worked on SAO encoding algorithm but not hardwarearchitecture. No publication on hardware implementationsof SAO encoding is found so far and only one software im-plementation instance, HEVC reference model HM, is wellknown. HM is the software aimed to implement encodingtools as many as possible, which make it performs well incompression efficiency and hence it is a good comparisonobject for evaluating the proposed encoding algorithm ef-fect.

The SAO estimation algorithm in HM (we follow ver-sion 12.0) has good BD-rate performance, but it is not easyfor VLSI design. The SAO estimation algorithm in HM isdivided into two phases. The first is statistic collection andthe second is parameters determination. In the first phase,the difficulty for VLSI design is that there are so many sam-ples to deal with for statistic collection. The algorithm in

Fig. 4 Current CTB SAO parameters and neighboring CTB SAO param-eters merge.

HM deals with each sample one by one without consider-ing the throughput performance, which is obviously unac-ceptable for VLSI implementation. In the second phase, thedifficulty for VLSI design is that the RDO (Rate DistortionOptimization) frequently used in various SAO parametersdetermination in HM algorithm is unsuitable for VLSI im-plementation.

In this article, we propose fast encoding algorithmbased on HM algorithm and its VLSI architecture. For thefirst difficulty, bitmaps are used to collect statistics of 16samples in one 4 × 4 block simultaneously and thus thethroughput can be improved. For the second difficulty, a se-ries of complicated procedures in HM algorithm are simpli-fied to achieve a better balance between BD-rate and com-plexity.

Experimental results show that the proposed algorithmmaintains the picture quality improvement. The VLSI de-sign based on this algorithm can be implemented with156.32 K gates, 8,832 bits single port RAM, 400 MHz @65 nm technology and is capable of 8 K × 4 K @ 120 fpsencoding.

The rest of this article is organized as follows. Sec-tions 2 and 3 introduce the details of SAO estimation algo-rithm in HM 12.0 and our improved proposals respectively.Section 4 describes the VLSI architecture in detail. The ex-perimental results and implementation results are illustratedin Sect. 5. Finally, Sect. 6 concludes this article.

2. SAO Estimation Algorithm

The SAO estimation algorithm in HM 12.0 is dividedinto statistic collection phase and parameters determina-tion phase. They are illustrated in the following two sub-sections.

2.1 Statistic Collection

As introduced in Sect. 1, there are three types for SAO. Oneis SAO not applied and other two of them, edge offset (EO)and band offset (BO), are effective SAO types. The divisionand classification of the two effective SAO types are shownin Fig. 5. There are four classes (directions) for EO (EO 0:horizontal; EO 1: vertical; EO 2: diagonal 135; EO 3: di-agonal 45) and four categories for each of EO class. Thereare 32 bands for BO. Refer [5] for details. 16 EO categoriesand 32 BO bands are collectively called 48 classifications.

For each classification, information count (C) and sum(S) shall be collected. C means the number of sampleswhich belong to the specified classification within one CTB.

Fig. 5 Categories for EO and bands for BO.

Page 3: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

2490IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014

Table 1 List of variables in statistic collection phase.

S is the sum of difference between original samples and re-constructed samples which belong to the specified classifi-cation within one CTB. S and C are called S & C pair. For4:2:0 video, there are 48 S & C pairs for luma, cb and crrespectively and thus there are 144 S & C pairs in all. Thesevariables are listed in the Table 1. The left two columnsmean the Abbreviations (Abbr.) and Name for one vari-ables category, for example: C (Count). The right columnlists all the variables instances for this variable category.C [y|cb|cr] cX means that variables instances are for luma,cb, cr three components, C y cX, C cb cX, C cr cX. Therange of X (0. . . 47) means 48 classifications for each com-ponent. So there are 144 variable instances in all. Someexamples are C y c3, C cb c47, C cr c20. All the variablesin Table 2 to Table 5 is similar to this.

2.2 Parameters Determination

There are four procedures in parameters determinationphases. Procedure 1 is to determine offset (O), distortion(D) and cost (CO) for each classification of three compo-nents within one CTB. Procedure 2 is to determine the startband position (sbp) of luma, cb and cr for band offset. Pro-cedure 3 is to determine the type of SAO and the class (di-rection) of edge offset for luma and chroma. Procedure 4 isto determine whether left merge mode or upper merge modeor none merge mode is adopted. The four procedures areexplained as following.

• P1: Offset(O), distortion(D) and cost(CO) determina-tion. Iterations for Rate Distortion Optimization (RDO)are needed in the process to obtain O. All the values be-tween the rounded (S/C) (which is clipped into −7∼7) and0 are iterated as O candidates. For example, as shown inFig. 6, suppose the rounded quotient of (S/C) is −2, then−2, −1, and 0 are the three candidates for O. For each can-didate, Formulas (1) and (2) are used to calculate D andCO. CO candidates f, g, h correspond to the O candidate−2, −1, 0, as shown in Fig. 6. The candidate of O whosecorresponding CO (f or g or h) is minimum is the deter-mined O for current classification and component. Sincethere are 144 variables instances for S and C in Table 1,the number of variable instances of O is also 144, as listedin the 1st line of Table 2.D is obtained through Formula (1). The variable O, C, Sand D in Formula (1) may be replaced by the 144 vari-ables instances in Table 1 and Table 2.CO is obtained from Formula (2). CO, D and R in For-mula (2) can be replaced by the variables instances in Ta-ble 2. R (rate) in Formula (2) is obtained through rate

Fig. 6 Offset iteration.

Table 2 List of variables in procedure 1 of parameter determinationphase.

Fig. 7 Start band position determination.

Table 3 List of variables in procedure 2 of parameter determinationphase.

estimation, it is a function of the value of O in HM al-gorithm. L (lambda) in Formula (2) can be regarded asknown parameters in HEVC encoding. The parameter ofluma is different from the parameter of chroma.

D = O ∗ O ∗C − O ∗ S ∗ 2 (1)

CO = D + R ∗ L (2)

• P2: Start band position (sbp) determination. Thereare 32 bands for BO. Consecutive four bands form a bandgroup. As shown in Formula (3), the CO of one bandgroup is the sum of CO of the four bands within that bandgroup, it is written as CO bg.CO bg in Formula (3) can be replaced by the variablesinstances in Table 3. There are 29 bands group for eachcomponent in HM algorithms and there are three compo-nents in all. X in Table 3 means the position of first bandof the four bands in that band group. CO cX in Formula(3) shall be replaced with the first 29 variable instancesfor BO listed in the 3rd line of Table 2.There are 29 band groups in all. CO bg of the 29 bandgroups are compared and the band group with minimumCO bg is the selected band group. Its first band of thefour bands is the selected start band position (sbp). Sbp y,

Page 4: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

ZHU et al.: FAST SAO ESTIMATION ALGORITHM AND ITS IMPLEMENTATION FOR 8 K × 4 K @ 120 FPS HEVC ENCODING2491

Fig. 8 Types determination.

Table 4 List of variables in procedure 3 of parameter determinationphase.

sbp cb and sbp cr for luma, cb and cr are generatedthrough this way.

CO bg =3∑

X=0

CO cX (3)

• P3: Types and edge offset classes (directions) deter-mination. As introduced in Sect. 1, there are three typesfor SAO: edge offset, band offset and SAO not applied.For edge offset type, there are four classes (directions).So actually there are six sub-types candidates, which arelabeled from 0 to 5, as shown in Fig. 8. Each effectivesub-type (sub-type 0–4 in Fig. 8), no matter edge offsetor band offset, contains four classifications. For edge off-set, the four classifications of the sub-type are shown inFig. 3. For band offset, the four classifications of the sub-type is the consecutive four bands starting from sbp (startband position).The criteria to determine sub-type is also the Formula (2)in P1. Except that the meaning of CO, D and R are dif-ferent from those meanings in P1. CO, D and R in thisprocedure mean CO, D and R for one sub-type for lumaor chroma instead of for one classification for luma or cbor cr. D, R and CO in Formula (2) shall be replaced withthe variables instances listed in Table 4, X = 0. . . 5 meansthe 6 sub-types as shown in Fig. 8. D for EO or BO sub-types are calculated through Formula (4). Note that Dfor one sub-type is divided to only luma and chroma twocomponents instead of three components. D for EO or BOsub-types luma are the sum of D of the four luma classi-fications within that sub-type. D for EO or BO sub-typeschroma are sum of D of four cb classifications and four crclassifications within the sub-type. D for NA sub-type is0, no matter luma or chroma components. R (rate) in For-mula (2) in this procedure are obtained through CABAC(Context-based Adaptive Binary Arithmetic Coding). L

Fig. 9 Merge or not determination.

Table 5 List of variables in procedure 4 of parameter determinationphase.

(lambda) in Formula (2) in this procedure is same to thatin P1.The sub-type with minimum CO is the determined sub-type. Then types and classes for edge offset for both lumaand chroma components are generated through this pro-cedure.

D tp =4∑

X=1

D cX (4)

• P4: Modes (left merge, upper merge and no merge)determination. As shown in Fig. 9, upper CTB mergemode, left CTB merge mode and no merge mode are com-pared and the best one is selected as the mode of currentCTB. The criteria in the comparison of this procedureare a transform of cost (CO), which is named as COT(cost transformed) and shown in Formula (6). The crite-ria to determine the mode is the sum of COT for luma andchroma.COT in Formula (6) means the COT for each modeof luma or chroma component. It can be replaced bythe variables instances listed in 1st line of Table 5.COT [y|c] mdX (X = 0. . . 2) means the three modes (leftmerge, upper merge and no merge) for luma and chromacomponents. D and R are similar. D of no merge mode isthe D of the selected sub-type in P3. D of upper and leftmerge mode is the D of upper and left CTB. R of eachmode is obtained through CABAC. For each mode, COTof luma and chroma is added and the final result is thecriteria to determine the mode. The mode with smallestsum of COT of luma and chroma is the selected mode.

COT = D/L + R (5)

3. Proposals

Although the algorithm adopted in HM effectively raises the

Page 5: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

2492IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014

Fig. 10 Example of bitmap generation in statistic collection.

BD-rate performance of HEVC, it is difficult for VLSI im-plementation. In this section, proposals in the two phasesof SAO algorithm are suggested respectively to reduce thecomplexity and make it suitable for hardware implementa-tion.

In statistic collection phase, we propose to use bitmapsto collect statistic of 16 samples in one 4 × 4 block simul-taneously. This is efficient and suitable for hardware im-plementation. In parameters determination phase, a seriesof modification are adopted to balance the complexity andBD-rate performance. The structure of this section matchesthat of Sect. 2.

3.1 Statistic Collection

In our proposal, statistics of one 4 × 4 block (16 samples)are collected in one round (cycle). So for 64 × 64 CTB, 256rounds are needed to finish luma statistic collection and 64rounds are needed to finish cb and cr statistic collection re-spectively. There are 48 4× 4 bitmaps which match 48 clas-sifications mentioned in Sect. 2.1. Each bit in the bitmaprepresents whether the corresponding sample in the 4 × 4block belongs to the particular classification. S & C men-tioned in Sect. 2.1 are easily collected by means of bitmaps.

An example of how bitmaps are generated is shown inFig. 10. One 4×4 sample block together with its surroundingsamples is inputted as one 6×6 block, which is shown in top-left of Fig. 10. For edge offset, 16 bitmaps are generated.For each sample, there are four classes (directions) for itstwo neighboring samples as shown in left-middle of Fig. 10.

Fig. 11 Example of generating S and C from bitmap.

For each class, the relationship between current sample andits two neighboring samples can be divided into one of thefour categories as shown top-right of Fig. 10. For example,the most top-left sample of the 4×4 block is 0×96. For class0, its two neighboring samples are 0 × 93 and 0 × 8a. Since0×96 is greater than 0×93 and 0×8a, it belongs to category4. The top-left bit in bitmap for class 0 and category 4 islabeled as 1. It means the corresponding sample (top-leftsample in 4 × 4 block) and its two horizontal neighboringsamples belong to category 4. All the samples in the 4 × 4block are analyzed for 4 classes (directions), and then 16bitmaps are generated as shown in middle-right of Fig. 10.

For band offset, all the samples are equally dividedinto 32 bands (classifications). They are labeled as BO 0,BO 1, . . . , BO 31. BO 0 range is 0–7, BO 1 range is 8–15,etc. In the example of Fig. 10, the top-left sample is 0 × 96,it belongs to classification BO 18. Hence, the top-left bitof the bitmap matches BO 18 is labeled as 1. All the 16samples are operated to determine which band (classifica-tion) they belong to. In this example, since all samples inthe 4×4 block are in the range from 0×8a to 0 × a6, so onlyfour bitmaps are non-zero. All other 28 bitmaps are all-zero.

After the 48 bitmaps are generated, S and C can be gen-erated easily through the operation of bitmaps. An exampleof how to use bitmap to generate S and C of one 4× 4 blockis shown in Fig. 11. The sum of all 16 bits in one bitmap isC, as shown in right-bottom of Fig. 11. To obtain S, firstly4 × 4 original samples and 4 × 4 reconstructed samples areinputted and their clipped difference is outputted. The ob-tained 4 × 4 block of difference are “and” with the bitmaps.Then each of the sample are added together to obtain S.

3.2 Parameter Determination

The parameter determination phase of HM is introduced inSect. 2.2. It is not suitable for hardware implementation. Inthis section, a series of modifications are proposed on thebase of original algorithm to reduce the complexity of orig-inal algorithm while keeping BD-rate performance. Thereare four procedures in the HM algorithm. Our proposals areillustrated in the order of the four procedures.

• P1: Offset, distortion and cost determination. InSect. 2.2, there is an iteration process in finding offset.As shown in Fig. 6, suppose S is −11, C is 6 and roundedquotient of (S/C) is −2. In the original algorithm, threeoffset (O) candidates −2, −1, 0 are checked one by oneand RDO is used to evaluate the best offset according toFormula (1) and Formula (2) in Sect. 2.2. The iteration

Page 6: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

ZHU et al.: FAST SAO ESTIMATION ALGORITHM AND ITS IMPLEMENTATION FOR 8 K × 4 K @ 120 FPS HEVC ENCODING2493

process and the rate estimation process are complicated.In our proposal, these two processes are removed. Theoffset is obtained directly by rounded quotient of (S/C).In the proposed algorithm, −1 and 0 are not iterated. Re-sult of rounded quotient of (S/C), which is −2, is directlyselected as the offset. Given O, the process to obtain Dand CO remains unchanged compared from the originalalgorithm.• P2: Start band position (sbp) determination. In

Sect. 2.2, CO bg of the band group is used to determinethe sbp. The band group with smallest CO bg is the se-lected band group.In our proposals, D bg of the band group is used to re-place CO bg of the band group. As shown in Formula(6), D bg is the sum of D of the four bands within thatband group. The band group with smallest D bg is theselected band group. And the first band of the selectedband group is start band position.

D bg =3∑

x=0

D bx (6)

• P3: Types and edge offset classes (directions) determi-nation. In Sect. 2.2 P3, CO of one sub-type is used todetermine the types and edge offset classes for luma andchroma respectively. As shown in Formula (2), R of onesub-type for both luma and chroma component is requiredto obtain the CO. The process to obtain these R is throughCABAC.Unfortunately, there exists difficulties for the SAO esti-mation hardware implementation to include CABAC en-coder for calculating R. The main reason is that CABACencoder is quite large [10]. It is even larger than SAO es-timation implementation itself which is shown in Sect. 5of this article. So the cost to use CABAC encoder is high.To avoid this issue, we use constant value to replace thevalue from CABAC process for R. The rate value of thesub-type are listed in the top four lines of Table 6. ForSAO NA type, rates for both luma and chroma are 3. Foredge offset or band offset sub-type, rate for luma compo-nent is 10 and rate for chroma component is 16.When setting the value of these rate, it is expected that therate value in Table 6 should be close to the value obtainedthrough CABAC. Our basic logic in setting these valueis to count the number of bits of the syntax elements andthen make a discount on it which emulates the process of

Table 6 Value for rate estimation.

CABAC compression. The discount is a rough estimationaccording to experience and test results. These value hasbeen tested and proved to lead a good performance.In Table 6, NAL and NAC are 3 because under thissub-type (NA), only 3 syntax elements is transmit-ted. Sao type luma or sao type chroma are 2 bits,sao left merge flag and sao upper merge flag are 1 bitrespectively. So 4 bits syntax elements are transmitted.The bits number of syntax elements after CABAC shallbe less than the bits number of syntax elements beforeCABAC. So we set the value to be 3.EBL is 10 because under these kinds of sub-types (EOor BO), sao type luma (2 bits), eo classes (2 bits) or startband position (5 bits), and 4 offsets (4 × 4 = 16 bits) aretransmitted. So the bits number of syntax elements be-fore CABAC is 20 bits (EO) or 23 bits (BO). We set thediscounted number of bits to be 10 by rough estimationand experimental results.EBC is similar to EBL except that there are 8 offsets (4 cband 4 cr) and 2 sbp (1 cb and 1 cr) are needed. So the bitsnumber of syntax elements before CABAC is 36 bits (EO)or 44 bits (BO). We set the discounted number of bits tobe 16 by rough estimation and experimental results.• P4: Modes (left merge, upper merge and no merge)

determination. In the original algorithm, there are twopoints unsuitable for hardware implementation in thisprocedure. Firstly, similar to the problem in last proce-dure, R in Formula (5) is obtained through CABAC. Sec-ondly, the division in Formulas (5) is unsuitable for VLSIimplementation.To avoid the two problems, we change the definition ofCOT in Formula (5) to Formula (7). Then the division inFormula (5) is removed. And R in Formula (7) is set toconstant instead of the value from CABAC. The relatedconstants are listed in the bottom three lines of Table 6.For upper merge or left merge mode, R is set to 1. This isbecause under these two modes, only 1 bit syntax elementsao left merge flag or sao upper merge flag is transmit-ted. For no merge mode, R is set to the sum of EBL andEBC, so it is 26.

COT = (D y + D c) + R ∗ (L y + L c)/2 (7)

In a word, a series of modifications are done to simplify theoriginal algorithm and make it suitable for hardware imple-mentation. Although so many modifications are done, theBD-rate performance of the improved algorithm still keepswell. The details are illustrated in Sect. 5.

4. VLSI Architecture

The whole SAO estimation architecture is divided to twomodules: statistic collection module and parameter determi-nation module, as shown in Fig. 12. pRec and pOrg meanssample blocks from reconstructed pictures and original pic-tures respectively. Info means S (sum) and C (count) for 48classifications of three components within one CTB, whichare introduced in Sect. 2.1 and Sect. 3.1. Results are SAO

Page 7: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

2494IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014

Fig. 12 SAO estimation block diagram.

Table 7 SAO parameters.

Fig. 13 Pipeline of SAO estimation.

parameters, which are introduced in Sect. 1, Sect. 2.2 andSect. 3.2. The results are also listed in Table 7.

For each CTB, the statistic collection module costs 256cycles for luma and 64 cycles for cb and cr respectively. Theparameters determination module costs 64 cycles to processeach component. The pipeline between statistic collectionmodule and parameter determination module is shown inFig. 13. The details of the two modules are explained inthe following two sub-sections.

4.1 Statistic Collection Module

The block diagram of statistic collection module is shownin Fig. 14. On each cycle, one 4 × 4 reconstructed blockis inputted to bo classification sub-module. It together withits surrounding samples are also inputted to eo classificationsub-module as one 6× 6 block. Then 16 EO bitmaps and 32BO bitmaps are generated from the two sub-modules as re-sults. For EO case, the boundary samples of one CTB arenot under statistic, which avoids reference samples of neigh-boring CTB. This is achieved by 16 mask sub-modules inFig. 14. There are 48 b2n sub-modules (16 for EO bitmapsand 32 for BO bitmaps) which output 48 C (count, shown inFig. 11). The diff sub-module in Fig. 14 output 48 S (sum,shown in Fig. 11). 48 unsigned accumulators (16 for EO and32 for BO) are needed to store C of the whole 64 × 64 CTBand another 48 (16 for EO and 32 for BO) singed accumu-lators are needed to store S of the whole 64 × 64 CTB.

For luma components, one 64×64 CTB can be dividedinto 256 4×4 blocks. So it takes 256 cycles to accumulate S

Fig. 14 Block diagram of statistic module.

Fig. 15 Block diagram of estimation module.

and C of one 64× 64 CTB. For 4:2:0 cb and cr components,it takes 64 cycles to do it.

4.2 Parameters Determination Module

The block diagram of parameters determination module isshown in Fig. 15. There are two sub-modules and two stor-age devices in it. The two sub-modules are dist & offsetgeneration (DOG) sub-module and cost generation & deci-sion (CGD) sub-module.

4.3 Storages in Parameters Determination Module

The two storage devices are necessary: one is an SRAMholding partial SAO parameters of upper CTB and the otheris register groups holding those of left CTB, which are usedin the derivation of current CTB SAO parameters. The con-tent of SAO parameters to store is all the parameters listedin Table 7 of upper or left CTB, except for the first line(merge left and merge upper). As shown in Table 7, eachoffset value takes four bits (its range is −7 to 7), so 12 (4categories by 3 components) offsets value take 48 bits. Startband position (sbp) is 5 bits (its range is 0–31), so sbp ofthree components take 15 bits. SAO type is 3 bits (BO, 4EO classes, SAO not applied, in all 6 options), so luma andchroma in together take 6 bits. Hence, for each CTB, there

Page 8: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

ZHU et al.: FAST SAO ESTIMATION ALGORITHM AND ITS IMPLEMENTATION FOR 8 K × 4 K @ 120 FPS HEVC ENCODING2495

Fig. 16 DOG diagram.

are 48+15+6 = 69 bits to store. For 8 K × 4 K video whoseCTB is 64× 64, there are 128 CTBs in one row of a picture.Hence the bits of SRAM needed to store upper CTB SAOparameters are 69 * 128 = 8,832 bits.

4.4 DOG Sub-Module

As shown in Fig. 16, DOG is divided into four blocks: blockA, block B, block C and block D. Block A holds 48 S & Cpairs of current CTB from statistic collection module in reg-isters groups and receives SAO parameters of neighboringCTB from the two storage devices mentioned above. BlockB generates offset (O) from S & C pair of current CTB. Ois obtained through the rounded (S/C). Since O is limited to−7 to 7, the division can be achieved by absolution operationplus the comparison of abs(S) and (C * n), where n is 0.5,1.5, 2.5 . . . 6.5. The calculated O together with S & C pairis inputted to block D. There are 48 S & C & O groups forcurrent CTB. Each group takes one cycle and thus it takes48 cycles to process each component of each CTB. BlockC selects S & C pair of current CTB according to the SAOtype, edge offset class and start band position of neighboringCTBs. The selected S & C pairs together with O of neigh-boring CTBs are inputted to block D. There are four S & C& O groups for left CTB and four groups for upper CTB.Each group takes one cycle and thus it takes 8 cycles to pro-cess each component of neighboring two CTBs. Block Dreceives S & C & O from block C or block D and outputsD (distortion) according to Formula (1). Finally DOG mod-ule output D and O to CGD module. It takes 48 cycles forcurrent CTB, four cycles for left CTB and upper CTB re-spectively. The operations in this module for luma, cb andcr are the same.

4.5 CGD Sub-Module

As shown in Fig. 17, CGD is divided into three sub-sub-modules. Among them, offset storage is composed of a pileof register groups which hold all the offsets (O) from DOG.This module finally outputs four Os according the type, edgeoffset classes and sbp of current CTB SAO parameters fromcost compare sub-sub-module.

As shown in Fig. 18, the type dist sub-sub-module re-

Fig. 17 CGD diagram.

Fig. 18 Diagram of type dist.

Fig. 19 Diagram of cost compare.

ceives Ds of 48 classifications of current CTB and 8 cat-egories of Ds of left and upper CTB from DOG. It accu-mulates D for four EO classes, BO, left and upper mergemode respectively. The band offset distortion accumulationis achieved by a four layers shifter registers, a register holdminimum band offset distortion and a comparator. The sumof the 4 registers is compared to the register which holdsminimum band offset distortion. If the sum is smaller, thenthe register is updated to the sum and its corresponding bandposition is stored. When all 32 bands offset distortions areinputted to the shifter registers, the distortion of band offsetis stored in the register and the start band position is alsoobtained.

The cost compare sub-module is shown in Fig. 19. Thedistortions of current CTB, left merge and upper merge areinputted to cost compare sub-module. The function of thissub-module is to obtain the cost of current CTB SAO param-eter, left merge and upper merge mode and compare them.The smallest is chosen to be the determined type.

Page 9: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

2496IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014

5. Experimental Results

Experiments are conducted to show that the modified SAOestimation algorithm keeps good BD-rate performance com-pared to original HM algorithm.

The document [11] defines common test conditions andsoftware reference configurations to be used in the contextof experiments. This document defines 8 test conditions.From which, we select the Low delay, main, P slices onlycondition. That is to say, encoder lowdelay P main.cfg inthe HM 12.0 package [12] is the basic configuration file forthe study. This is because SAO effect is most obvious underthis condition [5].

Note that this condition implies that the encoder op-erates in 8-bit mode. Only 8-bit mode encoder and 8-bitsource sequences are tested in our study. However, no par-ticular difficulties are foreseen to apply the proposed algo-rithm and VLSI architecture to higher bit depth cases.

The document [11] also defines the set of test se-quences for test conditions. There are six classes (class Ato class F) of sequences and there are 3–5 sequences foreach class. For each class from class B to class F, one se-quence is selected as test object, as shown in Table 8. Thesesource sequences are all 8-bit. Ten frames for each of thesesequences are tested in this study.

HM 12.0 [12] works as the reference for BD-rate mea-surement and the basis of our modified algorithm.

The evaluation criteria in [13] is adopted in this arti-cle. The Bjøntegaard measurement method [14] for calcu-lating objective differences between rate-distortion curveswas used as evaluation criterion to evaluate the performanceof the proposed algorithm. In the practical operation, BD-rate (piecewise cubic) is calculated through the excel filepublished in the [11] package.

As shown in Table 8, separate rate-distortion curves forthe luma and chroma components were used; hence result-ing in three different average bit-rate differences, one foreach of the components. The left three columns record theBD-rate reduction rate between SAO off and original SAOestimation on. The right three columns record the BD-ratereduction rate between SAO off and modified SAO estima-tion on.

It is shown the luma BD-rate has some degradation.The chroma BD-rate has been even better than original ones.This means that the original HM12.0 algorithm is not per-fect. For example, the rates obtained from CABAC in theparameters determination phase may be not accurate be-cause only partial rather than a complete set of SAO param-eters is through CABAC in that procedure.

In addition to the BD-rate for different components, theBD-rate for combined components is also used in this study.Using the bit rate and the combined PSNR yuv as the inputto the Bjøntegaard measurement method gives a single av-erage difference in bit rate that (at least partially) takes intoaccount the tradeoffs between luma and chroma componentfidelity [13]. The derivation of PSNR yuv is shown in For-

Table 8 BD-rate reduction comparison (three components).

Table 9 BD-rate reduction comparison (combined components).

mula (8) [13], PSNY y, PSNR u and PSNR v are calculatedby the software (HM).

PSNR yuv= (PSNR y ∗ 6+PSNR u+PSNR v)/8 (8)

BD-rate reduction for combined components is shownin Table 9. The 1st column records the BD-rate reductionof HM12.0 SAO estimation algorithms. The 2nd columnrecords the reduction between the proposed SAO estimationalgorithms. It is shown that, although a lot of complexityis saved, BD-rate reduction of the proposed SAO estimationalgorithms is only a little bit lower than that of the HM12.0SAO estimation.

Column 3–6 of Table 9 show the BD-rate reduction ofthe four procedures of our algorithms respectively. In col-umn PX (X = 1. . . 4), these experimental data is collectedin the situation that only PX (procedure X) is modified ac-cording to our proposal in Sect. 3.2 and other proceduresare not modified. The data shows three points. Firstly, BD-rate reductions of these independent procedures are betterthan the reduction of these four procedures combined. Thismeets the expectation because each single procedure mod-ify the original algorithms less than all of them combined.Secondly, same procedure has different effect on differentsequences. Thirdly, it is obvious the BD-rate reduction ofour proposed algorithms (four procedures enabled together)is not the sum of four BD-rate reductions of four indepen-dent procedure enabled algorithm. Because these four pro-cedures are not independent. The impact of each procedureinfluences the impact of other procedures. And BD-rate cal-culations mentioned in [14] is a non-linear algorithm.

The synthesis results of the proposed VLSI architectureare shown in Table 10. The VLSI architecture is supposed tobe suitable for all the bit depth cases. But actually only 8-bitdepth case of the proposed VLSI architecture has been im-plemented. And contents in Table 9 are for 8 bits depth im-plementation. Although 10/12 bits depth VLSI implemen-tations have not been verified, no particular difficulties areforeseen for their implementations at this moment.

Page 10: PAPER Special Section on VLSI Design and CAD Algorithms ... · 2488 IEICE TRANS. FUNDAMENTALS, VOL.E97–A, NO.12 DECEMBER 2014 PAPER Special Section on VLSI Design and CAD Algorithms

ZHU et al.: FAST SAO ESTIMATION ALGORITHM AND ITS IMPLEMENTATION FOR 8 K × 4 K @ 120 FPS HEVC ENCODING2497

Table 10 Syntheisi results.

6. Conclusion

In this article, we propose fast SAO estimation algorithmsand its corresponding VLSI architecture. Our proposals ef-fectively solve the huge amount samples and complex RDOdifficulties. The proposed algorithm still keeps good videoBD-rate performance, and it is suitable for high performanceVLSI implementation.

Acknowledgments

This research was partially supported by the regional inno-vation strategy support program of MEXT.

References

[1] http://www.ultrahdtv.net/what-is-ultra-hdtv/[2] Recommendation ITU-R BT.2020, “Parameter values for ultra-high

definition television systems for production and international pro-gramme exchange,” Aug. 2012.

[3] B. Bross, W.-J. Han, G.J. Sullivan, J.-R. Ohm, and T. Wiegand,“High Efficiency Video Coding (HEVC) Text Specification Draft 8,”document JCTVC-K1003, Sept. 2012.

[4] http://en.wikipedia.org/wiki/High Efficiency Video Coding[5] C.-M. Fu, E. Alshina, A. Alshin, Y.-W. Huang, C.-Y. Chen, C.-Y.

Tsai, C.-W. Hsu, S.-M. Lei, and J.-H. Park, and W.-J. Han, “Sampleadaptive offset in the HEVC standard,” IEEE Trans. Circuits Syst.Video Technol., vol.22, no.12, pp.1755–1764, 2012.

[6] J. Zhu, D. Zhou, G. He, and S. Goto, “A combined SAO and de-blocking filter architecture for HEVC video decoder,” IEEE Interna-tional Conference on Image Processing (ICIP), pp.1967–1971, 2013.

[7] S. Park and K. Ryoo, “The hardware design of effective SAO forHEVC decoder,” IEEE 2nd Global Conference on Consumer Elec-tronics (GCCE), pp.303–304, 2013.

[8] M. Mody, N. Niraj, and H. Tamama, “High throughput VLSI archi-tecture supporting HEVC loop filter for Ultra HDTV,” IEEE ThirdInternational Conference on ICCE-Berlin, pp.54–57, 2013.

[9] G.B. Praveen and A. Ramakrishna, “Analysis and approximation ofSAO estimation for CTU-level HEVC encoder,” Visual Communi-cations and Image Processing (VCIP), pp.1–5, 2013.

[10] J. Zhou, D. Zhou, and S. Goto, “A high-performance CABAC en-coder architecture for HEVC and H.264/AVC,” IEEE InternationalConference on Image Processing (ICIP), pp.1568–1572, 2013.

[11] F. Bossen, “Common test conditions and software reference config-urations,” Tech. Rep., Jan. 2013, Document of Joint CollaborativeTeam on Video Coding (JCT-VC), JCTVC-L1100.

[12] “https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-12.0,” HEVC reference model HM-12.0

[13] J.-R. Ohm, G.J. Sullivan, H. Schwarz, T.K. Tan, and T. Wiegand,“Comparison of the coding efficiency of video coding standards —Including High Efficiency Video Coding (HEVC),” IEEE Trans. Cir-cuits Syst. Video Technol., vol 22, no.12, pp.1669–1684, 2012.

[14] G. Bjøntegaard, “Calculation of Average PSNR Differences betweenRD Curves,” document VCEG-M33, ITU-T SG 16/Q 6, Austin, TX,April 2001.

Jiayi Zhu received the B.E. and M.E.degrees from Shanghai Jiao Tong University,Shanghai, China. He is currently pursuinga doctor degree with Waseda University, Ki-takyushu, Japan. His interests are in algorithmsand VLSI architectures for multimedia and com-munication signal processing.

Dajiang Zhou received the B.E. and M.E.degrees from Shanghai Jiao Tong University,China. He received the Ph.D. degree in engi-neering from Waseda University, Japan, in 2010,where he is currently an assistant professor. Hisinterests are in algorithms and VLSI architec-tures for multimedia and communications signalprocessing.

Shinji Kimura received the B.E., M.E. andDr. of Eng. Degrees in information science fromKyoto University, Kyoto, Japan in 1982, 1984,and 1989 respectively. He was an Assistant Pro-fessor at Kobe University from 1985, was anAssociate Professor at Nara Institute of Scienceand Technology from 1993, and has been a Pro-fessor of Waseda University since 2002. His in-terest includes design verification of VLSI andlow power design. He is a member of ACM,IEEE, IEICE and IPSJ. He has served an execu-

tive committee member of ICCAD 2011 and 2012 and the general chair ofASP-DAC 2013.

Satoshi Goto received the B.E. and theM.E. Degrees in Electronics and Communica-tion Engineering from Waseda University in1968 and 1970 respectively. He also receivedthe Dr. of Engineering from the same Universityin 1978. He joined NEC Laboratories in 1970where he worked for LSI design, Multimediasystem and Software as GM and Vice President.Since 2002, he has been Professor, at Graduateschool of Information, Production and Systemsof Waseda University at Kitakyushu. He served

as GC of ICCAD, ASPDAC, VLSI-SOC, ASICON and ISOCC and was aboard member of IEEE CAS society. He is IEEE Life Fellow and IEICEFellow. He is Visiting Professor at Shanghai Jiao Tang University and Ts-inghua University of China and Member of Science Council of Japan.