Advances on Transforms for High Efficiency Video Coding

Advances on Transforms for High Efficiency Video Coding

Miguel Lobato de Faria Pereira Capelo

Dissertation submitted for obtaining the degree of

Master in Electrical and Computer Engineering

Jury

President: Prof. José Bioucas Dias

Supervisor: Prof. Fernando Pereira

Co-Supervisor: Dr. Matteo Naccari

Members: Prof. Luís Ducla Soares

April 2011

i

Acknowledgments

First, I would like to thank Prof. Fernando Pereira for giving me this opportunity and for supervising my Thesis.

The constant availability he showed to address my questions and the great amount of time he spent helping me

improve this work were essential to its conclusion. His effective working methodology and organization really

helped elevate my working standards and will serve as a reference for all my future life.

I would also like to express my gratitude to Dr. Matteo Naccari for sharing his vast technical knowledge and

experience. I would like to thank him for always showing interest in my work by providing new inputs and

precious advices, even when that meant injury to his own working schedule.

A special thanks to all the Image Group members for providing such a great work environment and for always

being available to help.

I would also like to thank my Mother and my Father for giving me all the possible conditions to get to this stage

and for always trusting my decisions. A special word to my Brother for challenging me to become a better

person by his example. I would also like to show my gratitude to all my family for their support and motivation.

Finally, I would like to thank all my friends, specially the ones who helped me get this far in my academic life

and those who kept me motivated and high-spirited during this period. A final word of thanks to my friend

André Martins for his companionship over the last months of work.

iii

Abstract

Nowadays, we assist to the massification of digital video in several multimedia applications. Digital video

coding plays a big role in this phenomenon, as it provides the necessary data compression to allow the

transmission and storage of digital video contents in the currently available supports and networks. However,

with the increasing presence of high and ultra high definition video contents resultant from the continuous

advances in video capturing and display technologies, the current state-of-the-art video coding standard, the

H.264/AVC standard, does not seem to provide the required compression ratios needed for their transmission

and storage in the currently available facilities. This fact has led to the need of new video coding tools than can

provide further compression efficiency regarding the H.264/AVC state-of-the-art. As an answer to these needs,

the ITU-T VCEG and ISO/IEC MPEG standardization bodies have started a new video coding standardization

project called High Efficiency Video Coding (HEVC) targeting the reduction of the coding rates in 50% for the

same quality.

In this context, this Thesis focus on the study, implementation and assessment of a novel coding technique

related to the important transform coding module, always present in the omnipresent predictive video coding

architectures. With this objective in mind, the state-of-the-art on transform coding is reviewed and the adopted

transform coding technique is presented. Since the adopted transform coding technique is intended for

integration in the emerging HEVC standard, the new coding tools introduced by this video coding standard are

also studied. Finally, a video coding solution using the adopted transform coding technique combined with the

HEVC framework is developed, implemented and evaluated.

The performance results obtained with the adopted transform coding technique reveal encouraging results in

terms of bitrate savings or quality gains when compared to the usual DCT, particularly for high definition video

content.

The main innovations present in this Thesis are related to the combination of the adopted transform coding

technique in the HEVC standard and to its performance evaluation for high definition video contents.

Keywords – Transform coding, discrete cosine transform, Karhunen-Loève transform, adaptive transform, High

Efficiency Video Coding standard.

v

Resumo

Actualmente, assistimos a uma massificação do uso de vídeo digital em diversas aplicações multimédia. A

codificação de vídeo digital desempenha um papel central neste fenómeno, possibilitando a transmissão e o

armazenamento deste tipo de dados através da sua compressão eficiente. No entanto, com o aumento da presença

de conteúdos vídeo de alta e ultra-alta definição resultante dos contínuos avanços verificados nas tecnologias de

captura e visualização de vídeo, a actual norma de codificação de vídeo de última geração, a norma H.264/AVC,

parece não conseguir atingir os factores de compressão necessários para a transmissão e armazenamento deste

tipo de conteúdos com os actuais recursos de transmissão e armazenamento. Neste contexto, existe a necessidade

de desenvolver novas ferramentas de codificação de vídeo que possibilitem o aumento dos factores de

compressão actualmente atingidos com a norma H.264/AVC. Em resposta a este necessidade, a ITU-T VCEG e

o ISO/IEC MPEG iniciaram um novo projecto com o objectivo de desenvolver uma nova norma de codificação

de vídeo denominada High Efficiency Video Coding (HEVC) e com o objectivo de alcançar reduções de débito

de 50% para a mesma qualidade.

Neste contexto, o trabalho desenvolvido nesta Tese está relacionado com o desenvolvimento, a implementação e

avaliação de uma nova técnica de codificação destinada ao módulo de compressão das transformadas que é

essencial nas arquitecturas preditivas de codificação de vídeo. Com este objectivo em mente, o estado da arte da

codificação com transformada é revisto e a técnica de codificação utilizada é apresentada. Como se pretende

combinar esta técnica com a norma emergente HEVC, as novas ferramentas de codificação introduzidas por esta

norma de codificação vídeo são igualmente estudadas. Finalmente, procede-se ao desenvolvimento, à

implementação e à avaliação de uma solução de codificação de vídeo que faz uso da técnica de codificação de

transformada adoptada no contexto da norma HEVC.

Os testes de desempenho realizados com esta técnica de codificação revelam resultados encorajadores em termos

de poupanças nas taxas de bits ou ganhos de qualidade quando comparados com a vulgarmente utilizada DCT.

Isto verifica-se especialmente para conteúdos vídeo de alta de definição.

As principais inovações apresentadas nesta Tese estão relacionadas com a combinação da técnica de codificação

de transformada adoptada na norma HEVC e a avaliação de desempenho feita para conteúdos vídeo de alta

definição.

Palavras-chave – Codificação com transformada, transformada discreta de co-seno, transformada de Karhunen-

Loève, transformada adaptável, norma High Efficiency Video Coding.

vii

Table of Contents

Chapter 1 - Introduction.......................................................................................................... 1

1.1. Context and Emerging Problem ............................................................................................................... 1

1.2. Objectives .................................................................................................................................................... 2

1.3. Thesis Structure ......................................................................................................................................... 3

Chapter 2 – Reviewing the State-of-the-Art on Transform Coding .................................... 5

2.1. Basics on Transform Coding ..................................................................................................................... 5

2.1.1. Unitary Transforms ............................................................................................................................... 7

2.1.2. One-Dimensional Transforms ............................................................................................................... 7

2.1.3. Two-Dimensional Transforms .............................................................................................................. 8

2.1.4. Three-dimensional Transforms ............................................................................................................. 9

2.1.5. Directional Transforms ......................................................................................................................... 9

2.2. Most Important Transforms ................................................................................................................... 10

2.2.1. Karhunen-Loève Transform ................................................................................................................ 10

2.2.2. Discrete Fourier Transform ................................................................................................................. 11

2.2.3. Discrete Cosine Transform.................................................................................................................. 12

2.2.4. Walsh-Hadamard Transform ............................................................................................................... 14

2.2.5. Discrete Wavelet Transform ............................................................................................................... 15

2.3. Final Remarks .......................................................................................................................................... 17

Chapter 3 – Main Background Technologies: Adaptive Transform and Early HEVC .. 19

3.1. An Adaptive Transform for Improved H.264/AVC-Based Video Coding .......................................... 19

3.1.1. Objectives............................................................................................................................................ 20

3.1.2. Architecture and Walkthrough ............................................................................................................ 20

3.1.3. Details on the Adaptive Transform ..................................................................................................... 21

3.1.4. Performance Evaluation ...................................................................................................................... 27

viii

3.1.5. Summary ............................................................................................................................................. 28

3.2. Introduction to the High Efficiency Video Coding Standard ............................................................... 28

3.2.1. Objectives............................................................................................................................................ 28

3.2.2. Technical Approach ............................................................................................................................ 28

3.2.3. Transform and Quantization................................................................................................................ 31

3.2.4. Summary ............................................................................................................................................. 33

3.3. Final Remarks .......................................................................................................................................... 33

Chapter 4 – Adopted Coding Solution Functional Description and Implementation

Details ...................................................................................................................................... 35

4.1. Objectives .................................................................................................................................................. 35

4.2. Architecture and Walkthrough ............................................................................................................... 36

4.3. HEVC Framework Functional Description and Implementation Details ........................................... 39

4.4. AT Encoder Function Description and Implementation Details .......................................................... 41

4.4.1. Reference Frame Upsampling ............................................................................................................. 41

4.4.2. Frame Partitioning ............................................................................................................................... 43

4.4.3. Motion Compensation Prediction Block Computation ....................................................................... 44

4.4.4. Forward Adaptive Transform .............................................................................................................. 45

4.4.5. Quantization ........................................................................................................................................ 52

4.4.6. Entropy Encoder.................................................................................................................................. 53

4.5. AT Decoder Functional Description and Implementation Details ....................................................... 55

4.5.1. Entropy Decoder ................................................................................................................................. 55

4.5.2. Inverse Quantization ........................................................................................................................... 57

4.5.3. Inverse Adaptive Transform ................................................................................................................ 57

4.5.4. Frame Reconstruction ......................................................................................................................... 58

4.6. Summary ................................................................................................................................................... 58

Chapter 5 – Performance Evaluation ................................................................................... 59

5.1. Test Conditions ......................................................................................................................................... 59

5.1.1. Video Sequences ................................................................................................................................. 59

5.1.2. Coding Conditions .............................................................................................................................. 61

5.1.3. Performance Evaluation Metrics ......................................................................................................... 62

5.1.4. Coding Benchmarks ............................................................................................................................ 63

5.2. Results and Analysis ................................................................................................................................ 64

5.2.1. Performance for CIF Resolution Video Sequences ............................................................................. 64

5.2.2. Performance for HD Resolution Video Sequences ............................................................................. 77

5.3. Summary ................................................................................................................................................... 81

Chapter 6 – Conclusion.......................................................................................................... 83

6.1. Summary and Conclusions ...................................................................................................................... 83

6.2. Future Work ............................................................................................................................................. 84

ix

Appendix A – Transforms in Available Image/Video Coding Standards ......................... 85

Appendix B – Recent Advances on Transform Coding .................................................... 109

References ............................................................................................................................. 129

xi

Index of Figures

Figure 1.1 – Digital video on a mobile phone, on a computer and on a television set [1,2,3]. ............................... 1 Figure 1.2 – Ultra high definition television set [4]. ............................................................................................... 2 Figure 2.1 – Typical transform-based image coding architecture. .......................................................................... 6 Figure 2.2 – Example of block artifacts in a highly compressed image [5]. ........................................................... 6 Figure 2.3 – 888 video cube [7]. ......................................................................................................................... 9 Figure 2.4 – Example of image block with diagonal edges. ................................................................................. 10 Figure 2.5 – Samples rearrangement for a diagonal down-left edge. .................................................................... 10 Figure 2.6 – 88 DFT basis functions [9]. ............................................................................................................ 12 Figure 2.7 – 88 DCT basis functions [9]. ............................................................................................................ 13 Figure 2.8 – Example of DFT versus DCT reconstruction periodicity effects. ..................................................... 13 Figure 2.9 – Analysis filter architecture [13]. ....................................................................................................... 15 Figure 2.10 – Example of a three-level 1D-DWT decomposition architecture [13]. ............................................ 16 Figure 2.11 – Example of a two-level 2D-DWT decomposition [14]. .................................................................. 16 Figure 2.12 – Example of a three-level 2D-DWT decomposition [14]. ................................................................ 16 Figure 3.1 – General architecture of the adaptive transform video coding solution [17]. ..................................... 20 Figure 3.2 – Forward adaptive transform architecture. ......................................................................................... 21 Figure 3.3 – Inverse adaptive transform architecture. ........................................................................................... 22 Figure 3.4 – (a) Original block. (b) MCP block. (c) Corresponding prediction error block [15]. ......................... 23 Figure 3.5 – (a) Shifted and rotated MCP block (shift: -0.25 pixels vertically; rotation: -0.5°). (b) Difference

between the MCP block and the shifted and rotated MCP block [15]. ................................................................. 23 Figure 3.6 – Set of estimated prediction error blocks (shift: -0.5 to 0.5 pixels, horizontally and vertically;

rotation: -0.5°) [15]. .............................................................................................................................................. 23 Figure 3.7 – Covariance matrix for a set of estimated prediction error blocks [18]. ............................................. 24 Figure 3.8 – Block of covariance values for the pixel in row 3, column 0, with the pixels in all other positions

[18]. ....................................................................................................................................................................... 25 Figure 3.9 – MKLT basis functions for the example in Figure 7 [18]. ................................................................. 25 Figure 3.10 – MKLT and DCT coefficients for the previous example [18]. ......................................................... 26 Figure 3.11 – MKLT and DCT coefficients amplitude versus scan position [18]. ............................................... 26 Figure 3.12 – RD performance for the H.264 Standard and H.264 AT video coding solutions [15]. ................... 27 Figure 3.13 – Basic HEVC encoder architecture [24]. .......................................................................................... 29 Figure 3.14 – Illustration of a recursive CTB structure with LCTB size = 128 and maximum hierarchical depth =

5 [22]. .................................................................................................................................................................... 30 Figure 3.15 – Parameters defining the geometric partitioning of a PU [22]. ........................................................ 30 Figure 3.16 - Signal flow graph of Chen‟s fast factorization for an order-16 DCT [22]. ..................................... 32 Figure 4.1 – Architecture of the developed coding solution. ................................................................................ 37

xii

Figure 4.2 – Example of (a) PU partitioning and (b) TU partitioning of a 32×32 CTB. ....................................... 39 Figure 4.3 – TU depths for the CTB in Figure 4.2 (b). ......................................................................................... 40 Figure 4.4 – Coding modes (intra-coding = „0‟ and inter-coding = „1‟) for the CTB in Figure 4.2 (b). ............... 40 Figure 4.5 – (a) Horizontal and (b) vertical motion vectors values for the CTB in Figure 4.2 (a). ....................... 41 Figure 4.6 – Half and quarter-pixel motion positions illustration [22]. ................................................................ 42 Figure 4.7 – Upsampled reference frame illustration. ........................................................................................... 43 Figure 4.8 – Example of MCP block computation for a 4×4 TU. ......................................................................... 45 Figure 4.9 – MCP block for the example in Figure 4.8 after the downsampling operation. ................................. 45 Figure 4.10 – Architecture of the forward adaptive transform module. ................................................................ 46 Figure 4.11 – Adopted coordinate system for a 4×4 block. .................................................................................. 47 Figure 4.12 – Rotation of a 4×4 UMCP block by an angle θ around its origin. .................................................... 48 Figure 4.13 – Two vectors, v1 and v2, connecting the same point D to two different points, P1 and P2,

respectively. .......................................................................................................................................................... 49 Figure 4.14 – Block positions (blue) converted to the Euclidean space (red) for the block in Figure 4.11. ......... 50 Figure 4.15 – Shifts applied to a rotated UMCP block with a shift parameter equal to δ for the horizontal and

vertical directions. ................................................................................................................................................. 50 Figure 4.16 – Set of shifted and rotated UMCP blocks for all possible δ combinations (for each θ). .................. 51 Figure 4.17 – Architecture of the entropy encoder module. .................................................................................. 53 Figure 4.18 – LZ77 terminology considering the coding of the third character in the input symbol stream. ....... 54 Figure 4.19 – Architecture of the entropy decoder module. .................................................................................. 56 Figure 4.20 – Architecture of the inverse adaptive transform module. ................................................................. 57 Figure 5.1 – First frame of the selected CIF video sequences. .............................................................................. 60 Figure 5.2 – First frame of the selected HD video sequence: Kimono sequence. ................................................. 61 Figure 5.3 – Container sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ......................... 65 Figure 5.4 – Container sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS codecs. ....... 65 Figure 5.5 – Foreman sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ........................... 69 Figure 5.6 – Foreman sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS. .................... 70 Figure 5.7 – Mobile sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ............................. 73 Figure 5.8 – Mobile sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS. ....................... 74 Figure 5.9 – Kimono sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ............................ 77 Figure 5.10 – Kimono sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS. ................... 78

xiii

Index of Tables

Table 3.1 - Approximated constants for an order-16 DCT [22]. ........................................................................... 32 Table 4.1 – 12-tap DCT-based interpolation filter coefficients [22]. .................................................................... 42 Table 4.2 – Reference QPs with the corresponding Qstep [34]............................................................................... 52 Table 5.1 – Selected QPs and their corresponding Qstep values. ............................................................................ 62 Table 5.2 – Container sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT. ................................................................................................................................................... 66 Table 5.3 – Container sequence percentage of inter-coded TUs for each QP and TU block size. ........................ 66 Table 5.4 – Container sequence percentage of TUs coded with the available transforms for each AT codec, QP

and TU block size. ................................................................................................................................................ 68 Table 5.5 – Foreman sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT. ................................................................................................................................................... 70 Table 5.6 – Foreman sequence percentage of inter-coded TUs for each QP and TU block size. ......................... 71 Table 5.7 – Foreman sequence percentage of TUs coded with the available transforms for each AT codec, QP

and TU block size. ................................................................................................................................................ 72 Table 5.8 – Mobile sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT. ................................................................................................................................................... 74 Table 5.9 – Mobile sequence percentage of inter-code TUs for each QP and TU block size. .............................. 75 Table 5.10 – Mobile sequence percentage of TUs coded with the available transforms for each AT codec, QP

and TU block size. ................................................................................................................................................ 76 Table 5.11 – Kimono sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT. ................................................................................................................................................... 78 Table 5.12 – Kimono sequence percentage of inter-coded TUs for each QP and TU block size. ......................... 79 Table 5.13 – Kimono sequence percentage of TUs coded with the available transforms for each AT code, QP and

TU block size. ....................................................................................................................................................... 80

xv

List of Acronyms

AT Adaptive Transform

CABAC Context-Adaptive Binary Arithmetic Coding

CAVLC Context-Adaptive Variable-Length Coding

CD Compact Disc

CfP Call for Proposals

CIF Common Intermediate Format

CTB Coding Tree Block

DCT Discrete Cosine Transform

DFT Discrete Fourier Transform

DVD Digital Versatile Disk

DWT Discrete Wavelet Transform

FRExt Fidelity Range Extensions

FRS Full Range shift and rotation parameters Set

GOP Group Of Pictures

HEVC High Efficiency Video Coding

HM HEVC test Model

HRS Half Range shift and rotation parameters Set

HVS Human Visual System

ICT Integer discrete Cosine Transform

ISND Integrated Services Digital Network

ITU-T International Telecommunication Union – Telecommunication Standardization Sector

JCT-VC Joint Collaborative Team on Video Coding

JPEG Joint Photographic Experts Group

KLT Karhunen-Loève Transform

xvi

LCTB Largest Coding Tree Block

LF Loop Filter

MB Macroblock

MCP Motion Compensated Prediction

MDDT Mode-Dependent Directional Transform

MICT Modified Integer discrete Cosine Transform

MKLT Modified Karhunen-Loève Transform

MPEG Moving Picture Experts Group

MV Motion Vector

NICT Non-orthogonal Integer discrete Cosine Transform

PSNR Peak Signal-to-Noise Ratio

PSTN Public Switched Telephone Network

PU Prediction Unit

QCIF Quarter Common Intermediate Format

QVGA Quarter Video Graphics Array

RD Rate-Distortion

ROT Rotational Transform

SCTB Smallest Coding Tree Block

SIF Source Input Format

SVD Singular Value Decomposition

TMuC Test Model under Consideration

TU Transform Unit

UMCP Upsampled Motion Compensation Prediction

VCEG Video Coding Experts Group

VLC Variable-Length Coding

VLI Variable Length Integer

VOP Video Object Planes

WHT Walsh-Hadamard Transform

1

Chapter 1

Introduction

This first chapter aims to introduce the motivation behind this Thesis. To do this, the relevant context is

introduced first, followed by the presentation of the emerging problem asking for an efficient solution. In this

context, the main objectives of this work are defined. Finally, the Thesis structure is described.

1.1. Context and Emerging Problem

Digital video has been a regular presence in our lives for many years now. Whether used for digital television, in

personal computers, hand held devices or other multimedia applications (see Figure 1.1), its use has grown

tremendously in the last years and it seems that this growth is not slowing down.

Figure 1.1 – Digital video on a mobile phone, on a computer and on a television set [1,2,3].

2

With the currently available transmission and storage supports, this growth is only possible with the use of

powerful compression tools allowing the reduction of the number of bits needed to represent the video content

by exploiting the data correlation and the limitations of the Human Visual System (HVS) to remove the

redundant and irrelevant data, respectively. These compression tools have been included in several video coding

standards defined by the International Telecommunication Union – Telecommunication Standardization Sector

(ITU-T) and the Moving Picture Experts Group (MPEG) over the last two decades. Currently, the H.264/AVC

coding standard, developed by the Joint Video Team (JVT) formed by the ITU-T Video Coding Experts Group

(VCEG) and ISO/IEC MPEG bodies, is considered the state-of-the-art in terms of video coding.

However, with the recent advances in video capturing and display technologies (see Figure 1.2), the presence of

High Definition (HD) and Ultra High Definition (UHD) video contents in various multimedia applications is

quickly increasing. Clearly, this type of video resolutions requires higher bandwidth for its transmission and

larger storage capacities. In this way, the compression ratios achieved by the current state-of-the-art video

coding standard for HD and UHD content do not seem enough taking in account the available transmission and

storage supports. With this in mind, the ITU-T VCEG and ISO/IEC MPEG bodies created the Joint

Collaborative Team on Video Coding (JCT-VC) which is currently developing a new video coding standard, the

High Efficiency Video Coding (HEVC) standard, with the objective of increasing the highest available

compression ratios, particularly for very high resolution video contents. To do this, new coding techniques have

to be developed that can guarantee better compression over the current ones even if at the price of some

additional complexity.

Figure 1.2 – Ultra high definition television set [4].

1.2. Objectives

In this context, this Thesis focuses on the design, implementation and assessment of a novel coding technique for

a particular data compression module: the transform coding. Transform coding is used since the first image and

video coding standards and it is still present in the current state-of-the-art video coding standard. The main

objective of transform coding is to remove the spatial redundancy present in a particular image or video frame by

transforming it from the spatial to the frequency domains. Since the HVS is less sensitive to the higher

frequencies than to the lower frequencies, this may also be an effective way to discard irrelevant data contained

in the higher frequency bands. Currently, all the available video coding standards make use of the Discrete

Cosine Transform (DCT), but in this work a novel coding solution is developed using a different transform

technique. This coding solution is intended to be used in the context of the emerging HEVC standard for high

and ultra high resolution video contents. In this way, this Thesis targets the following objectives:

Detailed review of the state-of-the-art on transform coding – First, it is necessary to make a detailed

review on the state-of-the-art on transform coding by studying its basic principles and concepts.

Study and implementation of the adopted transform coding technique – The second objective of

this work is to study and implement the adopted transform in order to allow its integration in a more

general video coding solution.

Study of the recent advances introduced in the HEVC standard – Then, the recent advances present

in the initial version of the emerging HEVC standard must be studied to allow the combination of the

adopted transform coding technique with this video codec.

3

Integration of the adopted transform coding technique in the HEVC context – With the previously

referred objectives achieved, it is then desirable to integrate (as much as possible) the adopted transform

coding technique in the HEVC codec, defining in this way the video coding solution adopted in this

Thesis.

Performance evaluation of the adopted video coding solution – Finally, the performance of the

developed coding solution is assessed to check its utility in the video coding context. Taking in account

the emerging problem considered and the target resolution of the HEVC standard, this evaluation is

intended to be made for high resolution video contents.

With the achievement of these objectives, it is possible to evaluate if the adopted video coding solution can be

useful for future application in the video coding context.

1.3. Thesis Structure

This Thesis is organized in six chapters and two appendixes, including this first chapter that is used to introduce

the work developed in this Thesis.

After this introductory chapter, Chapter 2 contains a review of the state-of-the-art on this Thesis main object of

study: transform coding. In this review, the reader is introduced to the basic principles and concepts on transform

coding. Additionally, the most important transforms are introduced, and their basic principles and features are

presented.

In Chapter 3, the two main technical elements behind the studies and implementations performed in this Thesis

are presented. First, a video coding solution making use of the transform coding technique adopted in this work

is reviewed, with natural emphasis on the proposed transform. Then, the currently under development HEVC

standard is presented.

Chapter 4 introduces the reader to the combined coding solution developed in this Thesis. To do this, the general

architecture of the adopted coding solution is presented and the functional description and implementation

details of its main modules are explained.

After describing the adopted coding solution in detail, Chapter 5 reports its performance evaluation. To do this,

the used test conditions are first defined. Then, the performance results obtained with these conditions are

presented and analyzed.

The last chapter of this work, Chapter 6, identifies the conclusions taken from the work developed in this Thesis

and provides some details on future work than can be done in its context.

In Appendix A, the details on the transform coding usage in the context of the available image and video coding

standards are presented.

Appendix B presents a review of some of the most relevant advances on transform coding.

5

Chapter 2

Reviewing the State-of-the-Art on

Transform Coding

This chapter contains a brief review of the state-of-the-art on transform coding. The chapter starts by reviewing

the basic concepts and principles on transform. Then, the most important transforms in the context of this Thesis

are presented in detail.

2.1. Basics on Transform Coding

Transform coding is one of the basic tools used in digital compression, notably image, video and also audio data.

In image and video compression, the transforms are mainly used to reduce the spatial redundancy by

representing the pixels in a frequency domain prior to data reduction through compaction and quantization.

Although this chapter will concentrate on reviewing transforms when applied with a coding/compression

purpose, transforms are a basic signal processing tool and, thus, they may be applied with other functional

purposes.

To achieve data compression, the original signal is decorrelated by using an appropriate transform, redistributing

its energy to a typically small number of transform coefficients, usually located in the low frequency region.

These coefficients can then be quantized with the aim of discarding perceptually irrelevant information, without

significantly affecting the subjective quality of the reconstructed/decoded image and video. Although the

transform process does not theoretically involve data losses, the closely associated quantization process is lossy,

since the original values cannot be recovered due to the associated quantization error. It may also happen that the

transform „becomes‟ lossy due to the numerical limitations associated to the transform implementation, e.g.

roundings and truncations. The transform operation in the context of a typical image codec is illustrated in

Figure 2.1.

6

Figure 2.1 – Typical transform-based image coding architecture.

As shown in Figure 2.1, the original signal is usually segmented into square blocks, typically with 8×8 samples.

Each block is then individually transformed, an operation known as block transform. With this block based

processing, it is possible to reduce the computational and storage requirements (i.e. the transform complexity)

when compared to transforming the whole image simultaneously. Transforming each block independently can

also capture local information better, exploiting the correlation between block samples in a more efficient way;

however, the correlation between blocks is typically poorly (or not) exploited. Moreover, this approach can

cause noticeable reconstruction errors at the block boundaries resulting in blocking artifacts, i.e., the boundaries

between adjacent blocks are highly visible (see Figure 2.2). This phenomenon occurs when the higher frequency

components required to reconstruct the sharp boundaries of each block are discarded or highly quantized. Thus,

the higher the compression ratio, the more noticeable the blocking artifacts become.

Figure 2.2 – Example of block artifacts in a highly compressed image [5].

From the compression point of view, an „ideal‟ transform should have the following characteristics:

Reversibility – A transform is reversible if the input signal can be recovered in its original domain after

applying the transform and its associated inverse transform without error (if no numerical constraints

exist). In image and video compression, this is an essential feature since the original data has to be

recovered in the spatial domain to be visualized.

Energy compaction – Energy compaction regards the capability to reduce the number of energy

elements without any loss of information by removing existing redundancy. This means that the ideal

transform must concentrate the original signal energy in the smallest number of coefficients possible.

Decorrelation – Decorrelated coefficients are coefficients that do not transmit the same information;

this assures that each coefficient carries additional information with no or small repetition and, thus, it

always adds value by itself.

Data-independent – A data-independent transform is a transform that is independent of the input signal;

ideally, the transform should achieve good compression efficiency for most image types. While it is

natural that the optimal transform depends on the input signal properties, the computational complexity

to find this optimal transform and the overhead required to transmit it to the decoder is not typically

practicable and desirable.

Low complexity – The complexity of a transform is related with the computational resources required to

perform it, e.g., the number of operations required; it is naturally desirable that a transform can be

7

performed with the lowest possible computational complexity and this may require the development of

fast transform implementations.

These characteristics have been largely adopted as the requirements for the choice of the transform adopted in

existing image and video compression standards. Next, the most important properties regarding transforms used

for image and video compression are identified.

2.1.1. Unitary Transforms

A unitary transform of a input data vector x is defined by

(2.1)

where B is a unitary square matrix and y is the vector with the transform coefficients. A square matrix is unitary

if its inverse is equal to its conjugate transpose, i.e., B-1

= B*T

. If a unitary matrix only has real entries, i.e., B =

B*, then its inverse is equal to its conjugate, B-1 = BT, and it is known as an orthogonal matrix.

The column and row vectors of a unitary matrix are orthogonal (perpendicular to each other) and normalized (of

unit length), i.e., orthonormal. This can be defined by

(2.2)

where bk is the kth

column of the unitary matrix B.

The vectors bk constitute a set of orthonormal basis vectors. Basis vectors are a set of vectors which can be

linearly combined to represent any vector in a given vector space. In a similar way, basis functions are a set of

functions than can be linearly combined to represent any function in the function space. In this case, the unitary

matrix B represents the unitary transform basis functions.

Unitary transforms have very interesting properties, notably in terms of image compression:

Reversibility – As mentioned above, the unitary transform basis functions assure the reversibility of

these transforms (B-1

= B*T

).

Energy conservation – All the energy from the input signal is preserved in the transform coefficients,

i.e.,

(2.3)

Energy compaction – Unitary transforms tend to pack a large fraction of the signal energy into just a

few transform coefficients.

Decorrelation – Most unitary transforms assure the decomposition of the initial signal into reasonably

uncorrelated transform coefficients.

Following these properties and the ideal properties for a transform to be used for compression as describe above,

the unitary transforms are the usual choice for the transforms used in image and video compression standards.

2.1.2. One-Dimensional Transforms

Considering x(n) a block of N input samples (spatial-domain), like in a speech or audio signal, and y(k) a set of N

transform coefficients (frequency-domain), a one-dimensional (1-D) transform is given by

(2.4)

where a(k,n) are the forward transform basis functions. The inverse transform used to recover the original signal

is defined by

(2.5)

where b(k,n) are the inverse transform basis functions.

8

Taking into consideration that the first basis vector typically corresponds to the „zero‟ frequency component, it

corresponds to a constant function and, thus, y(0) is known as the DC coefficient, which represents the mean

value of the waveform under transform (for the block transformed). This is the most important transform

coefficient since it is associated to the lowest frequency to which the human perception systems are typically

very sensitive. All the other transform coefficients are known as AC coefficients.

2.1.3. Two-Dimensional Transforms

Considering now x(m,n) a two-dimensional (2-D) NN array of samples, like in an image signal, and y(k,l) a

NN array of transform coefficients, the forward and inverse 2-D transforms are given by

(2.6)

(2.7)

where a(k,l,m,n) and b(k,l,m,n) are the forward and inverse transform basis functions, respectively.

There are two important classes of 2-D transforms: non-separable 2-D transforms and separable 2-D transforms.

A non-separable 2-D transform is performed by simply using the N columns (or rows) of the input array end to

end to form a single column vector of length N2, and then performing the transform in Eq. (2.1). Non-separable

2-D transforms exploit both the horizontal and vertical correlations in the input signal and typically require N4

arithmetic operations [6].

In a separable 2-D transform, both transform basis functions are separated into separate, horizontal (row) and

vertical (column), operations:

(2.8)

(2.9)

With these operations, a separable 2-D transform can be performed in two independent steps, applied one after

the other (and not jointly to both directions). The first step uses the horizontal basis function, , exploiting

the horizontal correlation in the data while the second step uses the vertical basis function, , exploiting

the vertical correlation in the data.

The separable 2-D transforms are implemented as two consecutive 1-D transform operations given by

(2.10)

In matrix notation

(2.11)

For symmetrical basis functions, this means basis functions which are similar in the vertical and horizontal

directions, Av = Ah = A, it comes

(2.12)

(2.13)

The multiplication of two N×N matrices requires N3 arithmetic operations (N arithmetic operations for each

entry of the final result matrix). Therefore, a separable 2-D transform, which has two matrix multiplications,

requires 2N3 arithmetic operations [6] (against the N

4 operations of non-separable transforms). Thus, for the

usual case where N 2, a separable 2-D transform is normally preferred in terms of implementation complexity.

9

2.1.4. Three-dimensional Transforms

Consider now x(m,n,p) a three-dimensional (3-D) NNN input signal. This signal has two spatial components

and one temporal component, forming a NNN cube. Figure 2.3 shows an illustration of a 888 video cube

formed by 8 frames, each providing a 88 data block.

Figure 2.3 – 888 video cube [7].

Considering y(k,l,q) the transform coefficients, the forward and inverse 3-D transforms are given by

(2.14)

(2.15)

where a(k,l,q,m,n,p) and b(k,l,q,m,n,p) are the forward and inverse transform basis functions, respectively.

With a 3-D transform, it is possible to exploit the correlation between the samples in the three main dimensions,

two in space and one in time. Particularly for video compression, it is possible to remove not only the spatial

redundancy (intra-frame coding), but also simultaneously the temporal redundancy (inter-frame coding).

Naturally, using this type of transform for a video sequence will cause a coding delay depending on the number

of frames accumulated to perform the 3-D transform. A particular 3-D transform is presented with more detail in

Section B.3.

2.1.5. Directional Transforms

A directional transform is a transform that uses information about the edges present in certain input data to

better exploit the correlation between the samples. The objective of these transforms is to improve its coding

performance by detecting and removing more spatial redundancy than non-directional transforms, increasing the

compression ratio for the target quality.

As shown in the previous sections, a separable 2-D transform is independently implemented through two 1-D

transforms: one along the vertical direction (along the input data columns) and another along the horizontal

direction (along the input data rows). This approach is very useful since both vertical and horizontal directions

are important according to the HVS; it is also useful in cases where the input data has important horizontal

and/or vertical edges. However, for data containing other directional edges - a typical situation in many image

signals - a separable 2-D transform may not be the best choice. As an example, consider the image block

presented in Figure 2.4, where a diagonal line divides two uniform regions. In this case, a separable 2-D

transform would generate a rather high number of non-zero AC coefficients, deteriorating the transform

compression performance in terms of energy compaction.

10

Figure 2.4 – Example of image block with diagonal edges.

In these situations, directional transforms may be used and useful. There are various kinds of directional

transforms, notably:

Mode-dependent directional transform – One approach is to store a set of different basis functions,

each one suitable for a specific edge direction. After detecting the edge direction or the most relevant

edge direction of a given input data, the corresponding basis functions are used to perform the 2-D

transform.

1-D directional transform, followed by a 1-D horizontal transform – With this approach, the first

step is to perform a 1-D transform along the direction of the input data edge. The second step is to

perform a 1-D horizontal transform, since the first row contains all DC coefficients and each of the other

rows contains all AC coefficients with the same index.

Directional ordering of the data block, followed by a separable 2-D transform - Another approach is

to rearrange the samples in the input data according to its directional edge (see Figure 2.5). After, a 1-D

transform is performed along the columns and the rows of the rearranged data, similarly to the separable

2-D transform process.

Figure 2.5 – Samples rearrangement for a diagonal down-left edge.

This subject is addressed with more detail in Section B.2, for a particular directional transform solution.

2.2. Most Important Transforms

In the following section, some unitary transforms of interest are presented, notably the Karhunen-Loève (KLT),

the Discrete Fourier (DFT), the Discrete Cosine (DCT), the Walsh-Hadamard (WHT) and the Discrete Wavelet

(DWT) transforms.

2.2.1. Karhunen-Loève Transform

The Karhunen-Loève Transform (KLT) is a unitary and orthogonal transform. It is non-separable and the

forward and inverse 1-D KLT for a vector x are defined by

(2.16)

where the matrix represents the KLT basis functions. The KLT does not have a fixed set of basis functions

since they depend on the original data. The KLT basis functions are determined with the following steps:

1. Covariance matrix of the input data computation – The covariance matrix Σ is defined as

Σ cov E (2.17)

where

11

E (2.18)

is the expected value of the ith

entry in the vector x.

2. Eigenvectors1 and eigenvalues

2 of the covariance matrix computation - Compute the matrix of

eigenvectors of the covariance matrix Σ

Σ (2.19)

where is the diagonal matrix of eigenvalues of the covariance matrix Σ, i.e.,

(2.20)

where λm is the mth eigenvalue of the covariance matrix Σ. The columns of matrix correspond to the

eigenvectors of the covariance matrix Σ, representing the KLT basis functions.

The main KLT advantage is:

Best energy compaction – The KLT is theoretically the best transform in terms of energy compaction

when compared to other transforms. The KLT is able to pack more signal energy in the same fraction of

coefficients or to pack a certain fraction of the total energy in the smallest number of coefficients.

The main KLT drawbacks are:

Data-dependent – The KLT uses data-dependent basis functions; this implies the continuous

computation of the input signal covariance matrix as well as its storage and transmission.

High complexity – The high number of operations required to determine the KLT basis functions

significantly increases its complexity.

The use of the KLT for image and video compression is rather uncommon as it fails to fulfill two of the

characteristics typically asked to an ideal, efficient transform: data-independent basis functions and low

complexity. The KLT is also known as Principal Component Analysis (PCA) and it may be used as a tool in

exploratory data analysis and predictive models, where it is essential to have the best performance possible in an

energy packing-sense [8].

2.2.2. Discrete Fourier Transform

The Discrete Fourier Transform (DFT) is a unitary and orthogonal transform that is used to decompose the

original data into its sine and cosine components. It is a 2-D unitary transform and its forward and inverse

versions are defined by

(2.21)

(2.22)

for an NN block of data samples.

The DFT basis functions correspond to sine and cosine waves with increasing frequencies. As noted before, the

first coefficient, y(0,0), represents the DC-component of the corresponding data; for example, for the image

luminance, it corresponds to its average brightness/luminance.

As the DFT is a separable transform, its basis functions can be represented as the product of two 1-D transforms,

given by

1 Formally, if A is a linear transformation, a non-null vector x is an eigenvector of A if there is a scalar such

that Ax = x.

2 The scalar is said to be an eigenvalue of A corresponding to the eigenvector x.

12

(2.23)

The 88 DFT basis functions are visually shown in Figure 2.6.

Figure 2.6 – 88 DFT basis functions [9].

For a N-length vector, computing a 1-D DFT requires N2 arithmetic operations. To reduce the complexity of this

transform, a fast DFT implementation is often used, which is well known as the Fast Fourier Transform (FFT). A

FFT algorithm can reduce the number of arithmetic operations to only NlogN for a 1-D DFT [10].

The main DFT advantage is:

Fast implementation – Using a FFT algorithm, it is possible to significantly reduce the DFT

complexity; this is a great advantage in comparison to the KLT. When compared to the next transforms,

this is not very significant as all of them also have fast algorithms to simplify their implementation.

The main DFT drawback is:

Complex coefficients – The DFT produces complex coefficients, with real and imaginary parts, this

means with magnitude and phase; the storage and manipulation of these complex values may be a

disadvantage when compared to other available transforms, e.g. the DCT which use real (and not

complex) numbers.

The DFT is not usually used for image and video compression, as there are other transforms considered to be

more appropriate, e.g. the DCT. Instead, it is widely used for spectrum analysis, to solve partial differential

equations and to perform other operations such as convolutions or multiplying large integers.

2.2.3. Discrete Cosine Transform

The Discrete Cosine Transform (DCT) is a unitary and orthogonal transform, conceptually rather similar to the

DFT but only using real numbers (and not complexes anymore). For an NN block of samples, the forward 2-D

DCT is defined by

(2.24)

and the inverse 2-D DCT is defined by

(2.25)

with

13

(2.26)

Like the DFT, since the DCT is also a separable transform, it can be represented as the product of two 1-D

DCTs. The 88 2-D DCT basis functions are visually shown in Figure 2.7.

Figure 2.7 – 88 DCT basis functions [9].

Since the cosine function is real and even, i.e., cos(x) = cos(-x), and the input signal is also real, the inverse DCT

generates a function that is even and periodic in 2N, considering N the length of the original signal sequence. In

contrast, the inverse DFT produces a reconstruction signal that is periodic in N; these effects are illustrated in

Figure 2.8. In Figure 2.8, the original sequence in (a) is transformed and reconstructed in (b) by using a forward-

inverse DFT pair and in (c) by using a forward-inverse DCT pair. The periodicity of the inverse DCT is 10

samples long, twice as long as the periodicity of the inverse DFT. It can be noted that the DCT reconstruction

introduces less severe discontinuities at the end of the sequence than the DFT reconstruction. The importance of

this DCT property is that reconstruction errors at the blocks boundaries, and consequent blocking artifacts, are

less severe in comparison to those of the DFT.

Figure 2.8 – Example of DFT versus DCT reconstruction periodicity effects.

For highly correlated signals, the DCT compaction performance comes very close to the KLT performance.

However, unlike the KLT, the DCT basis functions are not data-dependent, avoiding the computation of the data

covariance matrix, along with its storage and transmission.

14

There are also many fast DCT implementation algorithms, notably the Fast Cosine Transform algorithm

(FCT).These algorithms can perform a 1-D DCT for a vector with length N with NlogN arithmetic operations

[11].

The main DCT advantages are:

Fast implementation with only real computations – Like the DFT, the DCT can be implemented

using fast algorithms which can greatly reduce the number of operations and, thus, its computational

complexity. In addition to this, the DCT only requires real computations, avoiding the manipulation of

complex numbers as in the DFT.

Reduced blocking artifacts – The DCT properties in terms of its periodicity help avoiding border

discontinuities; this may considerably reduce the appearance of blocking artifacts.

With these advantages and no significant drawbacks, the DCT is by far the most widely used transform for

image (e.g. JPEG standard) and video compression (e.g. ITU-T H.26x recommendations and MPEG standards).

2.2.4. Walsh-Hadamard Transform

The Walsh-Hadamard Transform (WHT) is a unitary and orthogonal transform. It is separable and its forward

and inverse 1-D transforms for a vector x with length 2m are defined by

(2.27)

where the matrix Hm represents the WHT basis functions. The matrix Hm is a 2m2

m Hadamard matrix, i.e., a

square matrix whose entries are either +1 or -1 and whose rows are mutually orthogonal, given by

(2.28)

where

is a normalization factor.

Some examples of these matrices for various block sizes are

(2.29)

(2.30)

(2.31)

The Fast Walsh-Hadamard Transform (FWHT) is an efficient algorithm to compute the WHT. A FWHT

algorithm can reduce the number of required arithmetic operations to compute a 1-D WHT from N2 to NlogN

[12].

The main WHT advantage is:

Fast and simple implementation – The Hadamard transform matrices are purely real, containing

values that are either +1 or -1. In this way, the WHT only has to perform very simple real operations,

significantly reducing the transform‟s complexity. Moreover, with the usage of a FWHT algorithm, the

WHT is considered the best transform from a complexity point of view.

The main WHT drawback is:

Modest energy compaction – From an energy compaction perspective, the WHT is not as efficient as

alternative transforms like the DCT; in fact, compared to all the other transforms presented in this

chapter, the WHT has the worse compaction performance [6].

The WHT is used in many signal processing and data compression algorithms, mainly because of its fast

implementation. In video compression, it may be used as a secondary transform, e.g. applied on the primary

15

transform DC coefficients to obtain even more compression in smooth regions, like in the H.264/AVC video

compression standard.

2.2.5. Discrete Wavelet Transform

The Discrete Wavelet Transform (DWT) is a unitary, orthogonal and separable transform that is usually applied

to the whole input data (or large parts of it called tiles) but typically not to small data blocks like all the

previously reviewed transforms. The DWT of an input signal x is computed by passing it through a series of

filters. First, the input samples are decomposed using a low-pass filter, g, i.e., a filter that passes low-frequency

signals but attenuates the high-frequency ones, and a high-pass filter, h, i.e., a filter that passes high-frequency

signals but attenuates the low-frequency ones. This operation is given by

(2.32)

where and are the low-pass and high-pass band coefficients, respectively.

The filters g and h must be closely related to each other in order to split the input signal into two bands, forming

a quadrature mirror filter, i.e.,

(2.33)

where f is the frequency. This property assures there is no information loss in the decomposition process.

With the operation in Eq. (2.32), half the signal frequencies are removed in both bands. In this way, according to

the sampling theorem3, half the samples can also be discarded and the outputs of the two filters, g and h, can be

subsampled by 2. This operation is given by

(2.34)

The filter analysis process described above is illustrated in Figure 2.9.

Figure 2.9 – Analysis filter architecture [13].

After this process, most the energy is usually located in the low-pass band. To increase the frequency resolution

in this band, further decompositions can be performed, repeating the operation in Eq. (2.34); this is illustrated in

Figure 2.10.

3 If a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates

at a series of points spaced 1/(2B) seconds apart.

16

Figure 2.10 – Example of a three-level 1D-DWT decomposition architecture [13].

Thus, for 1-D input signals, the successive application of the filters on the low-pass outputs results in a dyadic

decomposition, i.e., the number of coefficients for each novel lower band is half the number for the previous

decomposition.

As for 2-D input signals, the number of coefficients for each novel lower band is a quarter of the number for the

previous decomposition (2-D dyadic decomposition). In Figure 2.11, an explanation of a two-level DWT

decomposition for a 2-D input signal is shown; moreover, Figure 2.12 shows an example of a three-level DWT

decomposition for a real image.

Figure 2.11 – Example of a two-level 2D-DWT decomposition [14].

Figure 2.12 – Example of a three-level 2D-DWT decomposition [14].

As shown in Figure 2.11, the 2-D DWT results from applying a 1-D DWT first to the rows and after to the

columns of the input signal, which is typical of a separable 2-D transform.

The DWT is resolution-scalable, i.e., its coefficients allow the reconstruction of multiple spatial resolutions. For

example, in Figure 2.11, considering an NN image as the input signal, there are 3 different spatial resolutions

that can be recovered by the decoder:

LL2, with resolution

×

,

LL1 = LL2 + LH2 + HL2 + HH2, with resolution

×

and

LL0 = LL1 + LH1 + HL1 + HH1, with resolution .

Like other transforms, there are also many algorithms to perform a DWT in a more computationally efficient

way. These algorithms, known as Fast Wavelet Transform (FWT) algorithms, can compute a 1-D DWT for a

vector with length N with only N arithmetic operations [13].

17

The main DWT advantages are:

No blocking artifacts – Without any blocks since the transform is applied to the full image, there are

naturally no blocking artifacts in the decoded image.

Higher compression ratio – Transforming the whole input signal allows exploiting the correlation

between all neighbor samples and not only between samples of the same data block; this typically

allows reaching higher compression ratios.

Resolution-scalable – With the dyadic decompositions used in the DWT, it is possible to increase or

decrease the spatial resolution of the recovered data by simply increasing or decreasing the number of

coefficients decoded; this quality and spatial resolution scalability features are very useful for image

(and video) compression.

The main DWT drawback is:

High complexity – Performing a transform on the whole input signal, instead of dividing it in smaller

blocks, has a higher cost in terms of complexity. With a larger number of input samples, the number of

operations required to perform the transform also increases. This makes the DWT complexity

considerably high, despite having the most efficient fast algorithms in comparison to all the other

transforms presented in this chapter [13].

The DWT is mainly used for signal compression, particularly image and video compression (e.g. JPEG 2000

standard); it is also used for signal analysis, e.g., voice or even seismic data.

2.3. Final Remarks

In this chapter the basics on transform coding were presented. For details on the transform coding usage in the

context of the available image and video coding standards refer to Appendix A. Moreover, Appendix B presents

a review of some of the most relevant recent advances on transform coding.

The next chapter introduces two essential technical elements for the development of the adopted coding solution:

the adopted transform coding solution and the HEVC standard.

19

Chapter 3

Main Background Technologies: Adaptive

Transform and Early HEVC

The main purpose of this chapter is to present the two main technical elements which are behind the

implementation and studies presented in the next chapters. The first main background technical element is an

adaptive transform (AT) proposed in 2010 by Biswas et al. [15] to improve the video coding performance in the

context of the H.264/AVC standard. This adaptive transform is based on the KLT applied to prediction error

blocks and does not require its associated basis functions to be encoded and after transmitted to the decoder as

they are also estimated there. The main concepts and algorithms behind this technique are explained with more

detail in the first section of this chapter. The second main background technical element is the High Efficiency

Video Coding (HEVC) standard which is currently under development by JCT-VC group which was jointly

created by MPEG (ISO/IEC) and VCEG (ITU-T); the main objective of this recent standardization initiative is to

develop a new video codec for high and ultra high definition content with around 50% better compression

efficiency than the best H.264/AVC profile, the High profile.

3.1. An Adaptive Transform for Improved H.264/AVC-Based Video Coding

The spatial transform has always been a basic coding tool in all video coding standards developed in the past

decades. For most cases, a DCT has been adopted meaning that both the encoder and decoder know, since the

very beginning, which transform basis functions should be used. A main drawback of this type of solution is that

the transform basis functions do not consider the specific content to be coded and thus do not adapt to it,

reducing the energy compaction capabilities. However, there is the advantage that the transform basis functions

do not have to be neither computed nor transmitted.

An alternative solution is to adopt an adaptive transform which basis functions change depending on the content.

A solution following this principle is presented in [15] and will be adopted in this Thesis considering the

demonstrated compression performance. The authors propose a video coding solution allowing to adaptive select

20

the usual DCT or a modified KLT (MKLT), depending on the block to be coded. This solution allows adapting to

the block content without the burden of transmitting the KLT basis functions as they are equally estimated at

both the encoder and decoder sides. This method has been integrated in the H.264/AVC video coding

architecture to assess its performance against the standard H.264/AVC codec.

3.1.1. Objectives

As noted in Chapter 2, the KLT is the optimal transform in terms of energy compaction. Still, all the currently

available video coding standards make use of the DCT to represent the video information in the frequency

domain. This choice is due to the fact that the DCT, unlike the KLT, does not require the computation, coding

and transmission to the decoder of its basis functions for each block (i.e. it is data-independent) and can achieve

a near-optimal compression efficiency for highly correlated signals. A study presented in [16] reports that the

KLT improvements from an energy packing perspective when compared to the DCT are virtually lost by the

extra bits needed to represent its basis functions.

In [15], Biswas et al. propose a coding solution using an adaptive transform which allows a dynamic selection

between the DCT and a MKLT, depending on the block content. This solution does not require coding and

transmitting the MKLT basis functions. Instead, they are estimated in both the encoder and decoder using the

same technique, thus assuring equivalent transform basis at both ends of the coding chain. In this way, it is

possible to exploit the optimal behavior of the KLT, particularly for blocks which are hard to code using the

DCT (e.g. blocks with diagonal edges).

As the proposed KLT-based technique is only applicable to prediction error blocks, this adaptive transform

solution can only bring compression improvements for inter-coded blocks. This limitation is imposed by the

characteristics of the technique which will be later described.

3.1.2. Architecture and Walkthrough

As referred above, the AT solution was designed to be integrated in the standard H.264/AVC codec and improve

its compression efficiency. In this context, the main architecture of the proposed video codec is basically the

same as the H.264/AVC architecture (see Section A.8), with the exception of the forward and inverse transform

modules; however, as the bitstream syntax and semantics and the decoding behavior change, there is no

compatibility with H.264/AVC. The general architecture of this solution is shown in Figure 3.1.

Figure 3.1 – General architecture of the adaptive transform video coding solution [17].

21

A step-by-step walkthrough of the encoding process is presented next:

Macroblock splitting – First, the input video is split in 16×16 macroblocks as usual in H.264/AVC.

For the proposed coding solution, the authors use the FRExt extensions, implying that also 8×8 blocks

are available. However, in this case only 8×8 blocks are used for the transform operation (no motivation

is provided for this solution by the authors).

Transform – To transform each input block, the encoder decides whether to use the standard

H.264/AVC DCT (Integer DCT) or the proposed MKLT. This choice is made in a rate-distortion

optimized manner and is signaled to the decoder using only 1 bit for each coded block. In the next

section, the proposed adaptive transform is explained in detail.

Quantization – The transform coefficients (DCT or MKLT) for each block are then quantized in the

standard H.264/AVC way.

Entropy encoder – Finally, the quantized coefficients are entropy coded using CAVLC. For the DCT

transformed blocks, the standard H.264/AVC scanning orders are used (i.e. zigzag and alternate scans),

while for the MKLT blocks the coefficients are arranged from the highest to the lowest variance into

four 4×4 blocks which are then passed to the entropy encoder.

As for the decoder, the only difference regarding the H.264/AVC video coding standard is related to the

transform module, which will be described in the next section. The choice between the inverse DCT and the

inverse MKLT is made according to the information included in the bitstream by the encoder for each block.

3.1.3. Details on the Adaptive Transform

As referred before, the proposed adaptive transform video coding solution makes use of one of two transforms:

the DCT and a novel MKLT [15]. This adaptive transform is only applicable to inter-coded blocks, where only

the prediction error is transformed and quantized. The architectures for the forward and inverse adaptive

transforms are shown in Figure 3.2 and Figure 3.3, respectively.

Figure 3.2 – Forward adaptive transform architecture.

As shown in Figure 3.2, the forward adaptive transform consists basically on the computation of both the

forward DCT and the forward MKLT, followed by the selection of the transform which offers the best rate-

distortion performance. To compute the MKLT, the prediction error has to be estimated and the MKLT basis

functions computed based on the estimated prediction error, as it will be described below.

22

Figure 3.3 – Inverse adaptive transform architecture.

At the decoder side, the inverse adaptive transform consists on the computation of an inverse DCT or an inverse

MKLT, depending on which transform was selected in the encoding process, as shown in Figure 3.3. Again, to

compute the MKLT, the prediction error has to be estimated and the MKLT basis functions computed based on

the estimated prediction error, as it will be described below.

While the DCT is the same as already used in H.264/AVC for 8×8 blocks, this means an order-8 (i.e. 8×8) ICT

(see Section A.8.3), the MKLT, although similar to a standard KLT (Section 2.2.1), has some special features

which will explained below. With this in mind, the next section is dedicated to that transform.

From the observation of both the forward and inverse adaptive transform architectures, it is possible to conclude

that the Modified Karhunen-Loève Transform (MKLT) process includes 3 main modules (see colored blocks in

Figure 3.2 and Figure 3.3): prediction error estimation, MKLT basis functions computation and MKLT

computation (whether it is a forward KLT, in the forward AT case, or an inverse KLT, for the inverse AT case).

Therefore, these 3 modules are described in the following.

1) Prediction error estimation module

The prediction error is the difference between the original block and the Motion Compensated Prediction (MCP)

block, which in H.264/AVC is coded using motion vectors associated with one or multiple reference frames. In

this context, the prediction error (in the spatial domain) is indispensable to compute the standard KLT basis

functions. However, although the original prediction is available to the encoder, it is not available to the decoder.

This makes it impossible to compute the actual prediction error basis function at the decoder side and thus to use

the standard KLT. In this context, the only possibility to avoid the transmission of the basis functions to the

decoder and the consequent bitrate needs, is to estimate the prediction error. To do that, Biwas et al. [15] assume

that the prediction error is caused by errors in the motion estimation process, particularly:

Interpolation errors – In the motion compensation process, some errors can occur when

interpolating the reference frame pixels for quarter-pixel accuracy.

Imprecise edges prediction – In blocks with strong diagonal edges, the motion vectors may not be

predicted with full accuracy, thus causing small shifts in the location of the edges between the original

and the MCP block.

Following these assumptions, Biwas et al. [15] propose the estimation of the prediction error by simulating these

conditions. This is done by subtracting shifted and rotated versions of the MCP block from the MCP block itself

which plays here the role of the „original‟ data. The use of the MCP block for this purpose is natural as it is the

only piece of information that is simultaneously available at both the encoder and decoder. To exemplify this

operation, Figure 3.4 shows an original 8×8 block (a), its corresponding MCP block (b) and the prediction error

block (c). The MCP block is then shifted vertically by -0.25 pixels and rotated by -0.5°, resulting in the block

shown in Figure 3.5 (a). To complete the operation, the shifted and rotated MCP block is then subtracted from

the MCP block, Figure 3.5 (b).

23

Figure 3.4 – (a) Original block. (b) MCP block. (c) Corresponding prediction error block [15].

Figure 3.5 – (a) Shifted and rotated MCP block (shift: -0.25 pixels vertically; rotation: -0.5°). (b) Difference

between the MCP block and the shifted and rotated MCP block [15].

Despite the sign change when compared to the actual prediction error, see Figure 3.4 (c), the correlation between

the pixels in the estimated prediction error, Figure 3.5 (b), seems similar to the actual inter-pixel correlation in

„true‟ prediction error. This can be useful since the KLT basis functions are computed from the covariance

matrix of the input (error) block. To allow the exploitation of the above described prediction error properties in

the various directions, Biwas et al [15] propose the following shifts and rotations of the MCP block for the

prediction error estimation:

Shifts – The MCP block is shifted horizontally and vertically by 0.0, ±0.25 and ±0.5 pixels.

Rotations – The MCP block is rotated by 0.0° and ±0.5°.

In [15], Biwas et al. do not explain what criterion was used to define the maximum shift and rotation parameters

(0.5 pixels and 0.5°, respectively). This is one of the reasons why other maximum parameters will be tested later

in this Thesis. The combination of all 5 shift parameters along the horizontal and vertical directions results into

25 shifted MCP blocks (5×5=25). These shifted MCP blocks can then be rotated with 3 different rotation

parameters (-0.5°, 0.0° and 0.5°), resulting into a set of 75 shifted and rotated MCP blocks (25×3=75). Then, the

difference between the actual MCP block and the set of shifted and rotated MCP blocks is computed in order to

obtain a set of 75 estimated prediction error blocks. As an example, consider Figure 3.6 where a set of 25

estimated prediction error blocks is shown; in this case, only the results for a -0.5° rotation are shown.

Figure 3.6 – Set of estimated prediction error blocks (shift: -0.5 to 0.5 pixels, horizontally and vertically;

rotation: -0.5°) [15].

24

With a set of estimated prediction error blocks, it is then possible to compute the MKLT basis functions.

2) MKLT basis functions computation module

As previously referred, the KLT is a unitary and orthogonal transform; however, unlike the DCT, it is non-

separable. Thus, to transform a two-dimensional block, it is necessary to first convert the given block in a

column or row vector. Then, the covariance matrix of the vector must be computed to determine after its

eigenvectors, whose columns represent the basis functions of the transform. This process was described with

more detail in Section 2.2.1.

The MKLT proposed by Biwas et al [15] inherits the KLT characteristics referred above; however, in this case,

there are multiple input blocks representing a set of estimated prediction error blocks (as the „true‟ prediction is

not a available at the decoder). To determine the covariance matrix of this set, it is necessary to define the

covariance between each pixel position. Thus, the covariance between a pixel in position (u,v) and a pixel in

position (r,s) for a set of n×n blocks is given by

(3.1)

where u, v, r, s = 0…(n-1), j = u+n.v, k = r+n.s, Ei(u,v) is the estimated prediction error in position (u,v) of the ith

block, is the mean value and N is the number of blocks in the set.

Returning to the example shown above in Figure 3.4, Figure 3.5 and Figure 3.6, it is possible to determine the

covariance matrix for a set of estimated prediction blocks using Eq. (3.1), which is shown in Figure 3.7.

Figure 3.7 – Covariance matrix for a set of estimated prediction error blocks [18].

The row outlined in Figure 3.7 shows the covariance of the pixel in row 3, column 0, (considered here as the

reference pixel) with the pixels in all other positions. Rearranging this row to its original two-dimensional form

results in the covariance values block shown in Figure 3.8 where the red asterisk signals the reference pixel

position.

Covariance Matrix (scaled)

10 20 30 40 50 60

10

20

30

40

50

60

Covariance Matrix (scaled)

10 20 30 40 50 60

10

20

30

40

50

60

25

Figure 3.8 – Block of covariance values for the pixel in row 3, column 0, with the pixels in all other positions

[18].

Observing Figure 3.8, it is possible to conclude that the covariance with the reference pixel is higher for the

pixels in the direction of the edge, whether it is a positive covariance (in the same location of the reference pixel)

or a negative covariance (on the other edge of the block).

With the covariance matrix for a particular set of estimated prediction error blocks available, it is then possible

to determine the associated eigenvectors and eigenvalues

Σ (3.2)

where is the matrix of eigenvectors and is the diagonal matrix of eigenvalues of the covariance matrix Σ.

Subsequently the transpose matrix of the eigenvectors matrix is computed resulting in the MKLT basis

functions. For the example above, the basis functions are illustrated in Figure 3.9.

Figure 3.9 – MKLT basis functions for the example in Figure 3.7 [18].

Covariance with pixel at (row=3,col=0)

*

Covariance with pixel at (row=3,col=0)Covariance with pixel at (row=3,col=0)

*

26

In Figure 3.9, the set of basis functions is arranged in a horizontal raster scan order where it is possible to see

that the first basis functions (upper-left corner) show a subjective similarity to the actual prediction error.

3) MKLT computation module

After the determination of the MKLT basis functions, it is then possible to actually compute the MKLT both at

the encoder and decoder. Thus, the forward MKLT and the inverse MKLT are given by

(3.3)

where x is the actual prediction error (arranged in column vector), TMKLT are the MKLT basis functions, cMKLT are

the MKLT coefficients,

are the quantized MKLT coefficients and x‟ is the reconstructed prediction error.

Returning to the previous example, the MKLT coefficients of the actual prediction error block are shown in

Figure 3.10 alongside the DCT coefficients for the same block. It has to be noted that this block has strong

diagonal edges for which the DCT (because of its separable property) does not behave so well.

Figure 3.10 – MKLT and DCT coefficients for the previous example [18].

From Figure 3.10, it is possible to see not only the high energy compaction achieved by the MKLT (with almost

all the energy concentrated in the top-left coefficients), but it is also possible to compare it with the

corresponding DCT performance for this particular block, which distributes the same input signal energy along a

greater number of transform coefficients.

A further analysis can be made regarding the scan order for each transform. Considering a zigzag scan for the

DCT coefficients and ordering the MKLT coefficients by decreasing variance (as referred in the previous

section), it is possible to plot the coefficients in Figure 3.10 in terms of its amplitude and scan position as shown

in Figure 3.11.

Figure 3.11 – MKLT and DCT coefficients amplitude versus scan position [18].

27

The chart in Figure 3.11 shows that, for this example, the MKLT not only compacts the input signal energy in

fewer coefficients, but these coefficients are also the first to be scanned. On the other side, the input signal

energy is distributed along a larger number of DCT coefficients and additionally the zigzag scan does not seem

to efficiently arrange them by decreasing amplitude. As a consequence, it should be possible to entropy code the

MKLT coefficients with fewer bits than those required to code the DCT coefficients.

3.1.4. Performance Evaluation

To evaluate the performance of the proposed adaptive transform solution (dynamically selecting between the

DCT and MKLT), Biwas et al. have integrated it in the H.264/AVC video coding standard, notably in the JM

reference software, version 10.1 [19]. The experimental tests were conducted with QCIF and CIF resolution

video sequences and a frame rate of 30 fps for the following sequences: Foreman, Mobile, Garden and Husky.

The tests were made by encoding 50 frames of each video sequence and measuring the resulting PSNR as,

(3.4)

where MSE is the mean square error between the original and the reconstructed video frames.

To assess the benefits of the proposed solution (H.264 AT), its performance has been compared with the standard

H.264/AVC video codec (H.264 Standard) performance (the precise profile is not specified). The coded video

sequences use a regular pattern of one I-frame followed by P-frames for every group-of-pictures (GOP). The

coded video sequences have been selected as especially difficult to code with the DCT, notably including high

details and/or block areas with high variances and diagonal edges. Figure 3.12 shows the rate-distortion

performances for the proposed (H.264 AT) and benchmark codecs (H.264 Standard).

Figure 3.12 – RD performance for the H.264 Standard and H.264 AT video coding solutions [15].

28

Figure 3.12 shows an average PSNR gain of 0.5 dB for the proposed H.264 AT solution regarding the standard

H.264/AVC solution [15]. The sequence Mobile can even achieve PSNR gains of about 0.9 dB or alternatively

bitrates saves of about 20% (for the same quality) [15].

3.1.5. Summary

In this section, the video coding solution proposed by Biswas et al. [15] including an adaptive transform has

been reviewed in detail as it will play a central role in this Thesis. This coding solution makes use of a modified

KLT which allows the exploitation of the KLT optimal properties without requiring the coding and transmission

of its basis functions to the decoder. In this way, and in conjunction with the usual DCT, this video coding

solution can achieve a significant improvement in terms of compression performance when compared to the

actual state-of-the-art video coding standard, the H.264/AVC codec.

As already mentioned, the KLT-based technique proposed in this video coding solution is adopted for the video

coding solution to be studied in this Thesis. However, as this Thesis intends to address highly efficient solutions

for HD video content, it is more appropriate to integrate it in the HEVC standard, currently under development.

In this context, the next section is dedicated to the description of this emerging standard, focusing on its new

features but also on the main differences regarding the state-of-the-art H.264/AVC standard.

3.2. Introduction to the High Efficiency Video Coding Standard

The High Efficiency Video Coding (HEVC) standard is currently under development by the JCT-VC group

(jointly created by ISO/IEC MPEG and ITU-T VCEG) and it is planned to be ratified as a standard by January

2013 [20]. This new video coding standard targets providing 50% improved compression efficiency regarding

the state-of-the-art H.264/AVC video coding standard, notably for high and ultra high resolution video.

Officially, the HEVC standard development started in January 2010 with the publication of a Call for Proposals

(CfP) [21] asking for the submission of advanced video coding tools, specially targeting high and ultra high

resolution video. This CfP received 27 submissions with new coding tools and techniques providing encouraging

results in terms of coding efficiency when compared to H.264/AVC. These results led to the combination of the

most promising submitted coding tools into a new video codec called Test Model under Consideration (TMuC)

[22]. As the first available preview of the upcoming video coding standard, this test model will be used

throughout this Thesis as the target video codec to be improved.

3.2.1. Objectives

The development of technologies allowing the capture and display of high definition video contents has caused

an increasing presence of this type of resolutions in emerging multimedia applications. In future years, this

growth will not be limited to HD but it will also evolve to ultra high definition video contents (e.g. 7680×4320

pixels, which is 16 times the HD resolution). Undoubtedly, this type of contents requires higher bandwidth and

storage capacities, which do not seem to be available with the currently available transmission and storage

solutions. This problem can only be overturned by a significant improvement on the compression efficiency

provided by the actual video coding state-of-the-art associated to the H.264/AVC standard codec. Bearing this in

mind, the JCT-VC started the standardization of a new video coding standard with the objective of reducing by

half the bitrate needed to code a video sequence when compared to H.264/AVC High profile while maintaining

the same video quality. Clearly, this objective has the potential to cause an increase on the final codec

complexity. This standard targets coding progressive scanned content and video resolutions from QVGA

(320×240 pixels) to UHD (7680×4320 pixels).

3.2.2. Technical Approach

Since all the proposals submitted to HEVC Call for Proposals made use of the basic video coding architecture

used in previous video coding standards, particularly H.264/AVC, the HEVC coding architecture is also based

on intra and inter coding modes using motion compensated prediction and transform coding [23]. The basic

HEVC encoder architecture is presented in Figure 3.13.

29

Figure 3.13 – Basic HEVC encoder architecture [24].

Taking into account that a major difference between the HEVC and H.264/AVC standards relates to their target

resolutions, the submitted proposals focused their efforts on the exploitation of the higher spatial and temporal

redundancies available on high and ultra high definition video contents. These efforts resulted in various new

coding tools that change some of the main architectural modules as highlighted in Figure 3.13. These new

coding tools are described in the following, with the exception of those related to transform coding which are

explained later in more detail considering the topic of this Thesis.

Picture partitioning

First, the HEVC standard introduces a new image partitioning scheme based on a novel coding unit definition

and not anymore the usual macroblocks. The previous macroblock concept is replaced by a more flexible

structure comprised by Coding Tree Blocks (CTB). With this structure, each CTB can have various sizes (from

8×8 to 128×128, using always powers of 2) and can be recursively split according to a quad-tree partitioning.

The maximum size of a CTB and the maximum depth of the quad-tree partitioning are defined at the sequence

level.

The largest coding unit is denominated LCTB and the smallest SCTB. Each picture is divided into non-

overlapping LCTBs, and each CTB is characterized by its LCTB size and its hierarchical depth in relation to its

corresponding LCTB. To better understand this structure, Figure 3.14 illustrates an example where the LCTB

size is 128 and the maximum hierarchical depth is 5.

As shown in Figure 3.14, the recursive structure is represented by a series of split flags. If the split flag is equal

to 1, then the current CTB is split into four independent CTBs, characterized by an incremented depth and half

the size of the previous CTB. The picture partitioning is stopped when the split flag equals 0 or when the

maximum depth is reached, thus achieving the SCTB.

With this new partitioning structure, it is possible to code large homogeneous areas with larger coding blocks

than the previous 16×16 macroblocks used in H.264/AVC, allowing a better exploitation of the spatial

redundancy. Additionally, this coding structure allows a more flexible choice of the block sizes for a more

efficient coding of various contents, targeting multiple applications and devices.

30

Figure 3.14 – Illustration of a recursive CTB structure with LCTB size = 128 and maximum hierarchical depth

= 5 [22].

When the splitting process is finalized, the leaf nodes of the CTB hierarchical tree become Prediction Units (PU)

and can be split in the following ways:

Intra PUs – The intra PUs are not split or are split into 4 equal partitions.

Inter PUs – The inter PUs can have 4 symmetric splittings, 4 asymmetric splittings or can be split with

a geometric partitioning mode. In this last mode, the block is divided into two regions by a straight line

which is characterized by two parameters: the distance between the partition line and the block origin

(ρ), which is measured by a line perpendicular to the partition line, and the angle subtended by this

perpendicular line and the x axis (θ); for an example, see Figure 3.15.

Figure 3.15 – Parameters defining the geometric partitioning of a PU [22].

Besides the CTBs and PUs, the HEVC standard also introduces the Transform Units (TU). These units are

defined for transform and quantization purposes and can be as large as the size of the corresponding CTB leaf,

i.e., the corresponding PU. The partitioning of TUs is also represented by quad-trees, with their maximum size

31

and hierarchical depth being signaled in the bitstream. The transform block sizes are constrained to the

maximum and minimum transform sizes, 4×4 and 64×64, respectively. These characteristics are reviewed in

more detail in the following section dedicated to the transforms and quantization.

Intra prediction

For intra-coded blocks, the HEVC standard supports up to 33 spatial prediction directions for 8×8 to 64×64

blocks; this is done by means of a planar prediction mode. For 4×4 blocks, the 9 prediction modes already

present in H.264/AVC are used.

Motion compensation

To allow the exploitation of quarter-pixel accuracy motion vectors, the reference frame has to be upsampled and

be able to provide quarter-pixel accuracy interpolation. In H.264/AVC, to achieve this interpolation, a 6-tap

fixed Wiener filter is first used for half-pixel accuracy interpolation, followed by a bilinear combination of

integer and half-pixel values. With HEVC, it is possible to use a 12-tap DCT-based interpolation filter to provide

the same quarter-pixel accuracy interpolation. In this way, only one filtering procedure is needed, allowing a

simplification of the implementation and a complexity reduction of the filtering process.

Deblocking filter and In Loop filter

The H.264/AVC deblocking filter has been adapted in the HEVC codec to support the new larger block sizes.

Moreover a symmetric Wiener filter has been added to allow a reduction of the quantization distortion in the

reconstructed blocks.

Entropy coding

The HEVC standard offers two kinds of entropy coding methods:

Low-complexity entropy coding – For low-complexity, 10 pre-determined VLC tables are designed

for different probability distributions; each syntax element uses one of the 10 referred tables. For the

entropy coding of the transforms coefficients, an improved CAVLC method is used.

High-complexity entropy coding – For high-complexity, a variation of the CABAC solution defined

in H.264/AVC is employed. The bases of this codec are similar but the parallelization of the entropy

encoding and decoding is introduced.

These are the main technical novelties introduced in the emerging HEVC standard. In the following section, a

more comprehensive description of the adopted transforms is made, as this is the main topic of this Thesis.

3.2.3. Transform and Quantization

A larger transform can bring high performance improvements in terms of energy compaction and reduced

quantization error for large homogeneous areas (this is studied with more detail in Section B.1). HD sequences

tend to have more spatial correlation, this means in larger parts of the image. Thus, HEVC introduces three

additional transform sizes besides those already supported by H.264/AVC (4×4 and 8×8): 16×16, 32×32 and

64×64. With the increase of the transform size, the complexity also tends to increase. To minimize this

complexity, HEVC makes use of the fast DCT algorithm proposed by Chen in [25]. This type of algorithm is

used due to its reduced implementation complexity and its readily extension to larger transform sizes. In Figure

3.16, the signal flow graph of Chen‟s fast factorization for an order-16 DCT is presented.

32

Figure 3.16 - Signal flow graph of Chen’s fast factorization for an order-16 DCT [22].

In Figure 3.16, the multiplication constants are represented by sinusoidal functions of particular angles, which

can result in floating point operations; to overcome this drawback, pre-defined values are used (see Table 3.1).

With this approximation, the transform loses its orthogonal property but the errors associated are considered less

significant than the complexity increase corresponding to the floating point operations.

Table 3.1 - Approximated constants for an order-16 DCT [22].

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15

63/64 62/64 61/64 59/64 56/64 53/64 49/64 45/64 40/64 35/64 30/64 24/64 18/64 12/64 6/64

Besides the DCT, two types of directional transforms are adopted in the HEVC standard. These transforms are

used when the DCT basis functions do not offer a good transform performance, e.g. uncorrelated signals or

blocks with strong diagonal edges. The first directional transform is a Rotational Transform (ROT), which is

applied as a second transform after the DCT for blocks of 16×16 and higher sizes. The basic principle behind

this directional transform is the rotation of the transform basis coordination system, instead of the rotation of the

input data. The used rotation matrices (for vertical and horizontal rotations) are [22]:

The α angles represent the six possible rotation angles. From these six angles, only four rotation angles can be

quantized and used to minimize the complexity of the encoder. In this context, it has also to be noted that, for

TUs larger than 8×8, the ROT is only applied to the 8×8 lowest-frequency DCT coefficients.

The second type of directional transform is the Mode-Dependent Direction Transform (MDDT) which is used to

encode 4×4 and 8×8 intra prediction residuals and is paired with the selected intra prediction mode. The 33 intra

prediction modes for the 8×8 block size are grouped into nine separate directions; the MDDT is designed with

nine separate basis functions, one for each direction. These basis functions are estimated from the statistics of the

intra prediction residuals for each mode, using a separable transform based on the KLT, the Singular Value

Decomposition (SVD). This transform is used to better exploit the spatial redundancy (versus the DCT) without

1 2 3

1 3 1 2 3 1 3 1 2 3 2 3

1 3 1 2 3 1 3 1 2 3 2 3

1 2 1 2 2

, ,

cosα cosα - sinα cosα sinα -sinα cosα - cosα cosα sinα sinα sinα 0

cosα sinα + sinα cosα cosα -sinα sinα + cosα cosα cosα -sinα cosα 0

sinα sinα cosα sinα cosα 0

0 0 0 1

vertical

horizonta

R

R

4 5 6

4 6 4 5 6 4 6 4 5 6 5 6

4 6 4 5 6 4 6 4 5 6 5 6

4 5 4 5 5

, ,

cosα cosα - sinα cosα sinα -sinα cosα - cosα cosα sinα sinα sinα 0

cosα sinα + sinα cosα cosα -sinα sinα + cosα cosα cosα -sinα cosα 0

sinα sinα cosα sinα cosα 0

0 0 0 1

ll

33

excessively increasing the transform complexity (versus the KLT). In this way, the SVD is used first in the

vertical and then in the horizontal directions. Once again, to save computational effort, the transform matrices

are fixed-point approximated.

After the transform operation, the resulting coefficients are quantized in the same way as in H.264/AVC and

rearranged in a 1-D vector for entropy encoding. Besides the zigzag scanning order used for the DCT

transformed coefficients (including those using the ROT), a new scanning order is used for the MDDT

transformed coefficients, based on the directional mode used in the intra prediction coding. With this, it is

possible to compact the non-zero coefficients to the beginning of the resulting 1-D vector.

3.2.4. Summary

In this section, the HEVC standard under development was introduced. As this standard is not yet fully

developed, the previously reviewed coding tools correspond to the TMuC (Test Model under Consideration)

codec to be used later in this Thesis (more specifically TMuC software version 0.9). In the meantime, some tools

have been removed from the most recent versions of this codec - now called HEVC Test Model (HM) -

particularly the MDDT transform and the geometric partitioning mode.

The HEVC is being developed with the purpose of replacing the H.264/AVC video coding standard as the state-

of-the-art video coding standard. Moreover, it is designed taking into account that some main emerging

applications will use soon high and ultra high definition video contents. The new set of coding tools reflects this

concern as it focus on exploiting the highest spatial and temporal redundancies present in this type of sequences.

3.3. Final Remarks

In this chapter, the two most important background technologies for the solution to be implemented and studied

in this Thesis have been introduced. First, the adaptive transform technique proposed by Biwas [15] was

described. This transform uses the standard H.264/AVC DCT and a modified KLT which uses the MCP blocks

to estimate the prediction error and to subsequently calculate its basis functions. This technique is both

integrated in the H.264/AVC encoder and decoder and can bring improvements to the overall transform

performance when compared to the DCT alone, particularly for signals that are hard do compact using the DCT.

Secondly, the currently under development HEVC standard has been introduced; this is the codec adopted in this

Thesis as it is the most advanced. The HEVC standard, or at least its test model, introduces some new coding

tools that were designed to better exploit the special characteristics of high and ultra high definition video

contents, notably higher spatial and temporal correlations. Amongst the new coding tools, the main differences

in comparison to the H.264/AVC video coding standard associated to the topic of this Thesis (i.e. transform

coding) are related to the type of picture partitioning, notably with a more flexible partitioning allowing various

block sizes (from 8×8 to 128×128), and the transform sizes, notably with transform sizes from 4×4 to 64×64.

Unrelated to the type of video content, but related to transform coding, directional transforms are introduced to

better exploit the directional edges present in many blocks.

In the next chapter, the implementation details of the adopted transform coding solution will be presented.

35

Chapter 4

Adopted Coding Solution Functional

Description and Implementation Details

After the introduction of the two most important background technologies in Chapter 3, this chapter intends to

describe in detail the coding solution adopted in this Thesis, notably a functional description of each module and

its main associated features and a detailed explanation of its implementation.

The adopted solution central technical element, the transform coding block, is based on the adaptive transform

proposed in [15]. To better understand the reasons for the development, implementation and evaluation of this

coding solution, its objectives are first defined. After, its general architecture is presented, followed by a brief

walkthrough. Finally, each module in the presented architecture is individually described, analyzed and

explained, both from the functional and implementation points of view.

4.1. Objectives

As reviewed in Section 3.1, the solution proposed in [15] can achieve significantly better compression

performance than the standard H.264/AVC codec. This is achieved by means of an adaptive transform that can

switch between the standard H.264/AVC DCT and a modified KLT, whose basis functions are computed using

the same estimation technique in both sides of the coding process, thus not requiring its transmission along with

the remaining bitstream. Additionally, as referred in Section 3.2, the JCT-VC team is currently developing a new

video coding standard, the HEVC standard, which is intended to double the H.264/AVC compression efficiency,

particularly for high and ultra high definition video contents. With this in mind, the coding solution adopted and

implemented in this work uses the adaptive transform proposed in [15] with three main goals:

Adaptive transform performance evaluation in the context of the HEVC standard – As noted

before, the solution proposed in [15] was integrated and evaluated in the context of the H.264/AVC

standard. To evaluate the coding performance of the referred adaptive transform in the context of the

36

emerging HEVC standard, this tool must be, at least partly, integrated in this new video coding

standard.

Adaptive transform performance evaluation for high definition video content – In [15], the

proposed adaptive transform was only evaluated for QCIF and CIF resolution video sequences.

However, as noted in Section 3.2, the use of high definition video contents in various multimedia

applications is growing quickly. Thus, it is very relevant to assess the performance of the adaptive

transform for HD video contents, using the HEVC codec, to understand if the performance gains

obtained for the lower resolution contents still persist [15].

Adaptive transform performance evaluation for larger shift and rotation parameters – Finally, it

was noted in Section 3.1 that the motivation behind the specific choice of the used maximum shift and

rotation parameters is not explained in [15]. In this context, it is relevant to assess the performance with

increasing parameters values to check if this change can bring further compression performance

improvements.

To achieve these goals, a new coding solution is designed, developed and after evaluated using the same

concepts of the adaptive transform proposed in [15], although with some implementation changes. The technical

aspects of this new coding solution are presented in the following sections.

4.2. Architecture and Walkthrough

As referred before, the adopted video coding solution is based on the tool proposed, in 2010, by Biswas et al.

[15]. Thus, it also uses a similar adaptive transform technique to code the prediction error associated to the inter-

coded blocks. In this solution, the adaptive transform can switch between the standard H.264/AVC DCT

(Section A.8.3) and a modified KLT (very similar to the MKLT presented in Chapter 3) to obtain a better

compression performance, depending on the particular details of the image area being coded. It was also referred

above that this coding solution is based on the new HEVC codec as a replacement for the H.264/AVC codec

used in [15]. However, it has to be noted that the proposed adaptive transform is not integrated in the codec

reference software that is usually made available by the standardization groups, in this case the JCT-VC team.

The full integration was not made because it would not only require detailed knowledge of the software structure

and organization, which in this case would involve major extra time since this is a new software, still under

development, but it would also require major software development and testing which is not the main objective

of this Thesis. As a reasonable compromise, HEVC encoded and decoded data is obtained/extracted (using the

HEVC reference software) and used externally to simulate a large portion of the actual coding framework; for

example, the HEVC entropy coding tool is not used. In this way, the developed coding solution is only

applicable at the frame level, since the reference frames used for the inter-coded frames are always extracted

from the HEVC codec and are not decoded from previous codings using the developed coding solution.

The general architecture of the solution designed and implemented in this Thesis is presented in Figure 4.1. This

solution is only used to code the prediction error block; thus, it uses only the inter-coded frames as input.

Additionally, the bitstream generated by its encoding process and the reconstruction made by the decoding

process only contain information about the prediction error.

37

Figure 4.1 – Architecture of the developed coding solution.

The architecture presented in Figure 4.1 includes three main processes which are described next and clearly

identified in the figure with different colors:

HEVC framework – This process is used to extract data from the HEVC to be used in the encoding

and decoding processes of the adopted coding solution. In order to do this, the original frame is inter-

coded with the HEVC codec and the following data is extracted:

o Transform Units (TUs) split flags and coding modes.

o Reference frame.

o Prediction Units (PUs) motion vectors.

The extracted data is then provided to both the encoder and the decoder processes.

AT encoder – This process is used to encode each TU prediction error block using the proposed

adaptive transform. Additionally, the coefficients generated by this transform are also quantized and

entropy encoded. The modules of this process are processed in the following steps:

o Reference frame upsampling – First, the reference frame extracted from the HEVC

framework is upsampled to provide quarter-pixel accuracy.

o Frame partitioning – To process each TU individually, the original frame to be inter-coded is

first partitioned in its TUs using the HEVC defined partitioning method. This partition is made

with the split flags extracted from the HEVC framework. After the partitioning in TUs of the

full frame, only the inter-coded TUs continue the coding process. To verify which TUs were

HEVC intra or intra-coded, the coding modes extracted from the HEVC framework are used.

o MCP block computation – Then, the MCP block associated to each TU is computed using

the extracted motion vector and the upsampled reference frame. This MCP block is then

subtracted to the original TU, resulting in the prediction error block.

38

o Forward adaptive transform – The prediction error block is then transformed using the two

available transforms, the DCT and the MKLT. To compute the MKLT basis functions as in

[15], the motion vectors, the upsampled reference frame and the MCP block obtained in the

previous steps are used.

o Quantization – Each transform coefficient is after quantized using a uniform quantizer

described later.

o Entropy encoder – To finalize the encoding process, the quantized coefficients are entropy

encoded. Then, the resulting bitstreams for each transform (DCT and MKLT) are compared,

and the one corresponding to fewer bits is selected and sent to the decoder. This bitstream

corresponds to the adaptive transform bitstream in Figure 4.1.

With the adaptive transform bitstream sent to the decoder side, the encoding process is concluded. Finally,

the decoder process performs as follows:

AT decoder – This process is used to decode the adaptive transform bitstream sent by the encoder. To

do this, each inter-coded TU bitstream is entropy decoded, inverse quantized and inverse transformed

and the resulting reconstructed prediction error blocks are then rearranged to form the reconstructed

prediction error frame. The modules of this process are processed in the following steps:

o Reference frame upsampling – At the beginning of the decoding process, the reference frame

is once again upsampled to provide quarter-pixel accuracy.

o MCP block computation – Like the encoder, the MCP block is computed for each TU. In this

case it is only used for the MKLT basis functions computation.

o Entropy decoder – To decode the adaptive transform bitstream, it is first entropy decoded,

resulting in the adaptive transform quantized coefficients.

o Inverse quantization – The quantized coefficients obtained in the previous step are then

inverse quantized, resulting in the reconstructed adaptive transform coefficients.

o Inverse adaptive transform – These reconstructed coefficients are then inverse transformed

using the basis functions of the transform selected in the encoding process. To compute the

MKLT basis functions, in case they are needed, the motion vector, the upsampled reference

frame and the MCP block are once again used. With this operation, the reconstructed

prediction error block is obtained for each inter-coded TU.

o Frame reconstruction – To conclude the decoding process, all the reconstructed prediction

error blocks are arranged in the reconstructed prediction error frame, using the split flags and

the coding modes data extracted from the HEVC framework. This frame is only comprised by

the inter-coded TUs, since the intra-coded ones are not coded with the developed coding

solution.

With this, the decoding process of the adopted coding solution is concluded.

To implement the adopted coding solution, two programming environments were used: the TMuC software

(version 0.9 [20]) and the MATLAB numerical computing environment [26]. These environments were used

with the following purpose:

TMuC software (version 0.9) – This software was used to process the HEVC encoder and decoder

modules. This software code (programmed in C++) was changed in order to provide the necessary data

to the encoding and decoding processes.

MATLAB numerical computing environment – To process the encoding and decoding processes, a

MATLAB script was programmed. This script is comprised by some functions already present in the

MATLAB toolboxes and others designed and programmed especifically for this solution by the author

of this Thesis. It has to be noted that the decoding process was not implement independently from the

encoder. Thus, some of the modules present in the AT decoder architecture shown in Figure 4.1 were

not actually implemented, as the information provide by them was already known.

Following this short walkthrough of the developed video coding solution, the next sections will individually

explain each of the modules presented in Figure 4.1. To avoid repeating the more conceptual description made in

Chapter 3, this explanation will concentrate on the work developed by the author of this Thesis, notably focusing

39

on the implementation aspects. This explanation will be made first for the HEVC framework, then for the AT

encoder and, finally, for the AT decoder.

4.3. HEVC Framework Functional Description and Implementation Details

As referred before, the adopted coding solution uses a HEVC framework to provide the data needed by both the

AT encoder and the AT decoder in the right conditions. To do this, the original frame is coded using a slightly

modified version of the TMuC software (version 0.9). This software‟s code has been changed to provide the data

corresponding to each TU split flags and coding mode, used in the frame partitioning module, and the motion

data (i.e. each PU motion vector and the reference frame), used in the MKLT basis functions computation. To

understand how this data is extracted and stored, consider the CTB presented in Figure 4.2, with its PU

partitioning (a) and TU partitioning (b). As referred in Chapter 3, each CTB leaf represents a PU for motion

prediction coding and a TU for transform coding. These TUs can be further partitioned in smaller TUs using the

quad-tree partitioning method.

Figure 4.2 – Example of (a) PU partitioning and (b) TU partitioning of a 32×32 CTB.

The CTB in Figure 4.2 is a 32×32 block that is partitioned using the quad-tree technique employed in the HEVC

coding process. In this example, the CTB is partitioned in (a) PUs and in (b) TUs. The PU partitioning results in

7 PUs (3 of size 16×16 and 4 of size 8×8). The TU partitioning results in 10 TUs (3 of size 16×16, 3 of size 8×8

and 4 of size 4×4). With this example in mind, the extraction and storage procedures used in the HEVC decoder

are explained next:

TUs split flags – The partitioning split flags are extracted to provide the frame partitioning module

information on how to partition each frame in its corresponding TUs. Thus, the depth value of each TU

in relation to its corresponding LCTB is stored at the frame level with the granularity of the SCTB size.

This means that, for a R×C frame and a s×s SCTB, there will be a total of

×

depth values. Figure 4.3

shows how this data is stored considering that the CTB in Figure 4.2 represents a LCTB.

40

Figure 4.3 – TU depths for the CTB in Figure 4.2 (b).

From Figure 4.3, it is possible to see how the TU partitioning is signaled; to each TU corresponds a depth in

relation to the LCTB. In this way, three different TU sizes can be identified: 16×16 (depth = 1), 8×8 (depth

= 2) and 4×4 (depth = 3). The dashed grid identifies the SCTB size, in this case, 4×4. Each of these SCTB

sized blocks has a number corresponding to the depth of the TU where it is contained. The precision of the

saved data is ANSI-C int (32 bits).

TUs coding modes – Only the inter-coded TUs are processed with the adopted coding solution, since

the motion data is essential for the MKLT basis functions computation. Thus, only the TUs that were

inter-coded in the HEVC coding process can be coded with the proposed adaptive transform. With this

purpose, each TU coding mode must be identified. To do this, each TU must have a flag identifying its

coding mode: if it is a „0‟, then intra-coding was performed; otherwise („1‟), inter-coding was

performed. This information is also extracted for each TU and stored at the frame level with the

granularity of the SCTB size, resulting in

×

values for each frame. For the CTB in Figure 4.2, an

example result is shown in Figure 4.4.

Figure 4.4 – Coding modes (intra-coding = ‘0’ and inter-coding = ‘1’) for the CTB in Figure 4.2 (b).

By observation of Figure 4.4, it is possible to conclude that there are only 2 intra-coded TUs in the

considered CTB (both identified with a „0‟). The values are stored in ANSI-C short precision (16 bits).

Motion vectors – To compute the MCP block and the prediction error estimations necessary for the

computation of the MKLT basis functions, also the motion vectors of each PU have to be extracted.

Thus, for each PU, the horizontal and the vertical motion vectors values are provided and stored. As for

both previously described cases, this extraction also uses the granularity of the SCTB size, resulting in

×

values for each frame and for each direction (horizontal and vertical). Figure 4.5 shows the (a)

horizontal and (b) vertical motion vectors values for the CTB in Figure 4.2.

41

Figure 4.5 – (a) Horizontal and (b) vertical motion vectors values for the CTB in Figure 4.2 (a).

As expected, the TUs that were intra-coded (see Figure 4.4) do not have any motion vector value associated

to them. The motion vectors values are stored in ANSI int precision (32 bits).

Reference frame – The reference frame is essential to process the motion compensation module, since

it is the source referenced by the motion vectors values. In this way, the reference frame for each inter-

coded frame is stored in a R×C file with ANSI-C short precision (16 bits).

All the extracted data is saved in binary files (.bin) to be read by the developed MATLAB script. At this stage,

the developed coding solution implementation passes from the TMuC environment to the MATLAB

environment and the developed MATLAB script start its execution by reading the saved binary files values and

copying them to previously memory allocated matrices with the sizes referred before. This results in the

following matrices:

Split flags matrix – A

×

matrix containing the split flag of each TU.

Coding modes matrix – A

×

matrix containing the coding mode of each TU.

Motion vectors matrices – A

×

matrix containing each PU horizontal motion vector component

and a

×

matrix containing each PU vertical motion vector component.

Reference frame matrix – A R×C matrix containing the pixel values of the currently reference frame.

Besides these matrices, which are read from files saved with the TMuC software, the original frame available in

the original sequence file is also read and copied to a R×C matrix, denominated original frame matrix.

4.4. AT Encoder Function Description and Implementation Details

In the adopted coding solution, the encoder is basically used to code the prediction error block associated to each

TU. The key tool to code this type of data is associated to the transform coding which is exactly the tool under

study in this work. This process includes the following modules (as shown in Figure 4.1): reference frame

upsampling, frame partitioning, MCP block computation, forward adaptive transform, quantization and entropy

encoder. These modules are explained in detail in the following.

4.4.1. Reference Frame Upsampling

To provide the half and quarter-pixel prediction accuracy associated to the motion vector and perform the

prediction error estimation technique used in the adaptive transform computation, the relevant reference frame

needs to be upsampled with an upsampling factor of 4 (L = 4). This operation is still done at the frame-level and

performed by means of the 12-tap DCT-based interpolation filter described in Section 3.2.2, whose coefficients

are listed in Table 4.1.

42

Table 4.1 – 12-tap DCT-based interpolation filter coefficients [22].

Interpolation Filter coefficients

Quarter-pixel {-1, 5, -12, 20,-40, 229, 76, -32, 16, -8, 4,-1 } (18 additions,

6 shifts)

Half-pixel {-1, 8, -16, 24,-48, 161, 161, -48, 24, -16, 8,-1 } (15

additions, 4 shifts)

3 quarter-pixel {-1, 4, -8, 16, -32, 76, 229, -40, 20, -12, 5,-1 } (18 additions,

6 shifts)

To better understand the reference frame upsampling operation, consider the half and quarter-pixel motion

positions illustrated in Figure 4.6.

Figure 4.6 – Half and quarter-pixel motion positions illustration [22].

As shown in Figure 4.6, the interpolation of the integer pixels A, B, C and D results in the half and quarter-pixels

identified from a to o. These last pixels are determined using the above referred filter with the following

approach:

First, the half-pixel interpolations are computed using the integer pixels A, B and C. With this, it is

possible to obtain the pixels b (half-pixel horizontal interpolation of A and B) and h (half-pixel vertical

interpolation of A and C).

Then, the quarter-pixel interpolations are computed using the same integer pixels. With this, the pixels

a (quarter-pixel horizontal interpolation of A and B), c (3 quarter-pixel horizontal interpolation of A

and B), d (quarter-pixel vertical interpolation of A and C) and l (3 quarter-pixel vertical interpolation of

A and C) are obtained. These interpolations are computed for all the integer pixels.

After this, the pixels f, j and n are obtained by computing a half-pixel horizontal interpolation of d, h

and l and their corresponding pixels in relation to the integer pixel B, respectively.

Finally, the pixels e, i and m are obtained by computing a quarter-pixel horizontal interpolation and the

pixels g, k and o are obtained by computing 3 quarter-pixel horizontal interpolations of d, h and l and

their corresponding pixels in relation to the integer pixel B, respectively.

By applying these interpolations to all pixels in the reference frame, the upsampled reference frame is obtained

as shown in Figure 4.7.

43

Figure 4.7 – Upsampled reference frame illustration.

Figure 4.7 shows an illustration of an upsampled reference frame, where each darker blue square represents an

integer pixel, present in the actual reference frame. To implement this reference frame upsampling computation

in the developed MATLAB script, a function developed and provided by Dr. Matteo Naccari has been used. This

function receives the reference frame matrix as input and, by applying the interpolations described before,

returns the upsampled reference frame matrix.

4.4.2. Frame Partitioning

This partition is performed to allow the transformation and quantization of each TU individually as made in the

HEVC standard. With this in mind, the frame is divided in CTBs with the largest possible size, i.e. LCTBs,

defined at the sequence level. After, the HEVC quad-tree partitioning is replicated, with each CTB recursively

split into 4 blocks with half the height and width of their parent CTB, until the maximum depth is reached. Each

leaf CTBs in this operation is considered a TU and is processed individually. At this stage, the processing moves

from the frame level to the TU level, as intended.

After the frame partitioning, a verification has to made to check if the processed TU was inter or intra-coded in

the HEVC encoder module. This verification is performed with the help of the coding modes data. If the TU was

intra-coded, its processing stops here; otherwise, if it is an inter-coded TU, its processing continues to the

following steps.

In terms of the actual implementation, this partitioning is made with a MATLAB function especially designed

and programmed for this purpose. This function uses the following steps:

Partitioning in LCTBs – First, the number of LCTBs in a row and in a column are computed by

dividing the number of rows and columns in the frame by the LCTB width (or height), respectively.

With this information, it is then possible to go through all the LCTBs first pixel positions, using a

combination of two condition cycles; one having a number of iterations equal to the number of LCTBs

in a row and the other having a number of iteration equal to the number of LCTBs in a column.

Partitioning in TUs – For each LCTB, a recursive function is then used. This function basically starts

with a reference depth value of 0, a reference width value equal to the LCTB width (the LCTB height

could also be used here) and the first pixel position of the current LCTB (current pixel position) as

inputs. Then, the reference depth value is compared to the depth value present in the split flags matrix

position corresponding to the current pixel position. If the reference depth value is smaller than the

depth value of the current pixel position, its value is incremented and the reference width value is

divided by 2. Then, the recursive function is computed again with the newly computed reference values

44

and with the following 4 pixel positions as inputs (corresponding to the first pixel positions of the 4 new

block partitions):

o The current pixel position.

o The pixel position distanced reference width pixels away from the current pixel position

horizontally.

o The pixel position distanced reference width pixels away from the current pixel position

vertically.

o The pixel position distanced reference width pixels away from the current pixel position both

horizontally and vertically.

This is done until the reference depth value is equal to the depth value of the pixel position being processed.

When this happens, the TU level has been reached.

TU coding mode – Next, to verify if a particular TU was intra or intra-coded, the coding modes matrix

value corresponding to the pixel position being processed is checked: if it is „0‟ then the TU was intra-

coded and thus its processing stops here; otherwise (if it is „1‟), the TU was inter-coded and thus its

processing continues to the next step.

As a result of this partitioning, each inter-coded TU is clearly identified by its first pixel position and by its size

in the remaining steps of the adopted coding solution.

4.4.3. Motion Compensation Prediction Block Computation

As referred before, the MCP block is needed to compute the MKLT basis functions as proposed in [15]. Besides

this, it is also necessary to obtain the prediction error block at the encoder side, which is essential as this is the

data that is going to be transformed at the encoder and reconstructed at the decoder. Thus, the MCP block has to

be determined. To do this, the motion vector of a particular TU is used. This motion vector points to the position

of the MCP block at the reference frame. In reality, to provide half and quarter-pixel accuracy, it points to the

position of the Upsampled MCP (UMCP) block at the upsampled reference frame. Thus, to compute the MCP

block, the UMCP is first obtained and it is after downsampled at the end of this process.

Considering that a particular TU first pixel position (always considered the top-left corner pixel position) and

size are known (as a result of the frame partitioning module process which was described before), the

determination of its MCP block in the developed script is performed by the following sequence of steps:

1. Scaling of the first pixel position - First, the position of the first pixel of the currently being coded TU

is scaled by a factor equal to the upsampling factor used in the reference frame upsampling (L=4). To

do this, the variables representing the first pixel position (x and y) are multiplied by 4.

2. MCP first pixel position – Then, the motion vector values corresponding to the currently being coded

TU are obtained from the motion vectors matrix and added to the scaled pixel position obtained in step

1. This new position points to the first pixel position of the MCP block in the upsampled reference

frame.

3. Upsampled MCP block – It is then possible to crop the UMCP block from the upsampled reference

frame using the position obtained in step 2 as the first pixel of this block and considering that its size is

4 times the size of the currently being coded TU.

4. MCP block - Finally, to obtain the MCP block, the UMCP block is downsampled. To do this, all the

integer positions in the UMCP are arranged in a new block with the same size of the currently being

coded TU.

This operation is illustrated in Figure 4.8, for a particular 4×4 TU.

45

Figure 4.8 – Example of MCP block computation for a 4×4 TU.

The grid presented in Figure 4.8 represents the pixels of a portion of the upsampled reference frame. Pixel U

corresponds to the scaled position of the first pixel position of this particular TU (step 1). Adding the motion

vector values to pixel U, which are (3,-2) in this case, results in the position of pixel R, which is the first pixel of

the MCP block (step 2). Then, the UMCP block (delimited by a blue line in Figure 4.8) can be obtained by

cropping the resulting 16×16 block that starts on pixel R (step 3). Finally, to downsample this UMCP block, the

R and r pixels (representing the integer pixel positions of the UMCP block) are arranged in a 4×4 block, forming

the MCP block as in Figure 4.9 (step 4).

Figure 4.9 – MCP block for the example in Figure 4.8 after the downsampling operation.

After obtaining the MCP block, this module job is concluded and the encoding process continues to the forward

adaptive transform computation.

4.4.4. Forward Adaptive Transform

This module regards the main tool in the developed coding solution, the adaptive transform. The adaptive

transform is used to convert the prediction error block from the spatial-domain to the frequency-domain. As

referred before, the proposed adaptive transform uses two transforms, the DCT and the MKLT; thus, both these

transforms are computed. The decision about which transform coefficients are coded is only made after the

46

entropy encoder module, as this decision requires the knowledge of the number of bits required for each

transform resulting bitstream. The architecture of the forward adaptive transform is shown in Figure 4.10.

Figure 4.10 – Architecture of the forward adaptive transform module.

A detailed walkthrough of the architecture presented in Figure 4.10 is now presented to better understand each

processing block.

1) Forward DCT

The first transform to be computed is the DCT. This transform is a standard floating point 2-D DCT, already

described in Chapter 2. Thus, the forward DCT of a n×n prediction error block X is given by

(4.1)

where CDCT is the n×n DCT coefficients block and TDCT is the n×n DCT basis functions matrix defined as

(4.2)

where tDCT (j, i) represents the value of the DCT basis functions matrix at position (j, i). In the developed

MATLAB script, this transform is computed using the MATLAB function dctmtx [27], which returns the n×n

DCT basis functions. This basis functions matrix is then used to compute the DCT coefficients according to Eq.

(4.1).

With the DCT coefficients for the prediction error obtained, it is then possible to quantize and entropy encode

them.

2) Forward MKLT

Besides the DCT, the adaptive transform may also use a modified KLT. This MKLT is similar to the one

proposed in [15], with the only difference being related to the used shift (δ) and rotation (θ) parameters. The

MKLT computation involves three main steps: prediction error estimation computation, basis functions

computation and the MKLT transform computation itself. After the more conceptual description made in

Chapter 3, these three steps are described next with more focus on the implementation aspects.

a. Prediction error estimation computation

The implemented solution uses the same prediction error estimation technique described in Chapter 3, meaning

that it also uses the MCP block to estimate the prediction error by subtracting rotated and shifted MCP blocks to

the actual MCP block, resulting in a set of estimated prediction error blocks. The block rotations and shifts are

explained in the following, noting that both the shifts and rotations are applied over the upsampled MCP block

47

(UMCP) (with L=4) as both operations require quarter-pixel accuracy. To obtain the UMCP, the process used in

the MCP block computation is repeated, but now without the final downsampling step.

Rotations processing

First, the UMCP block is rotated by an angle θ using the following steps:

Coordinate system definition – Foremost, the block positions need to be converted to a new

coordinate system. This is done to use a rotation matrix R allowing the rotation of points in the xy-

Cartesian plane by an angle θ around the origin of the Cartesian coordinate system with a simple matrix

multiplication [28]. In this case, the block needs to be rotated around its centre; thus, the origin of the

Cartesian coordinate system must be the centre of the block. As all processed blocks have even width

and height (e.g. 4×4, 8×8, etc.), this centre is not a pixel position, but the intersection of the four central

pixel positions. With this in mind, the adopted coordinate system used to process the rotations is shown

in Figure 4.11 for a 4×4 block.

Figure 4.11 – Adopted coordinate system for a 4×4 block.

As shown in Figure 4.11, each pixel has size 2 in this coordinate system to allow using integer coordinate

values to reference the centre of the pixel positions. Thus, the odd coordinates are located at the middle of

the pixel positions, while the even ones are located at the intersections of the pixel positions. In this way,

only the odd coordinates refer to specific pixel positions, in this case corresponding to the centre of the

pixel. With this solution, there will be only odd coordinates when converting the block pixel positions to the

adopted coordinate system.

Rotation matrix definition – With the adopted coordinate system defined, it is now possible to

perform the rotation by an angle θ around the block origin by means of a matrix R given by [28]

(4.3)

Rotated coordinates – With this matrix, it is then possible to rotate a block using the following matrix

multiplication [28]

(4.4)

48

where (x’,y’) are the coordinates of the point (x,y) after rotation. Clearly, this operation can result in values

that are not odd for the rotated coordinates (x’,y’). Thus, all the values are rounded to the nearest odd value,

so they can reference an actual pixel position.

With these definitions, it is possible to rotate the UMCP block. For this effect, consider the UMCP block as part

of the upsampled reference frame, with the rotation axis centred at the centre of the UMCP block. To better

understand this, Figure 4.12 shows the rotation of an upsampled 4×4 MCP block (corresponding to a 16×16

block in reality) by an angle θ around its origin.

Figure 4.12 – Rotation of a 4×4 UMCP block by an angle θ around its origin.

Figure 4.12 shows the UMCP block (green coloured area) as part of the upsampled reference frame (blue

coloured area) before rotation. After an angle θ rotation is computed, the rotated UMCP block is identified by

the darker green and darker blue pixel positions. To better understand this operation, consider a window that

initially just shows the UMCP block, hiding the rest of the upsampled reference frame. Then, by rotating this

window, some previously shown pixel positions disappear and some previously hidden pixel positions appear;

this rotated window represents the rotated UMCP block.

As already referred, the rotations are computed for upsampled blocks. Thus, the angle by which the blocks are

rotated needs to be scaled by a convenient factor. To explain how this scaling factor is defined, consider Figure

4.13, where two vectors are displayed; a vector v1 connecting the point D (located on the x-axis and distanced d

from the origin) to point P1 (located on the y-axis and distanced h from the origin) and making an angle θ1 with

the x-axis, and a vector v2 connecting the same point D to point P2 (located on the y-axis and distanced L·h from

the origin, where L represents the upsampling factor) and making an angle θ2 with the x-axis.

49

Figure 4.13 – Two vectors, v1 and v2, connecting the same point D to two different points, P1 and P2,

respectively.

Considering that D can be the centre of a particular block, it is possible to consider that P1 is a pixel position and

P2 is the corresponding pixel position in the upsampled block (with an upsampling factor L). With this in mind,

it is also possible to consider that the angles θ1 and θ2 are the exactly same angle before and after the upsampling

process, respectively. Thus, to determine the convenient scaling factor for the rotation angle, it is only necessary

to find the relation between θ1 and θ2. In this way, taking into account the law of tangents, the tangent of θ1 and

θ2 can be given by

(4.5)

Combining the two equations in Eq. (4.5), it is then possible to obtain the following relation between θ1 and θ2

(4.6)

Using Eq. (4.6), and knowing that L=4, it is simple to obtain the scaled rotation angle for any θ value.

Concerning the θ values used to perform the rotations, besides the 0.0° and 0.5° rotation angles already used in

[15] (both clockwise and counter-clockwise), the developed coding solution also considers rotations up to a 1.0°

angle (in both directions). This results in a total of 5 possible rotations for each TU, notably 0, ± 0.5° and ±1°.

Applying the scaling factor in Eq. (4.6) to the previously mentioned θ values, results in 0.0°, 1.99° and 3.99°

rotation angles for the UMCP blocks, which are approximated to 0.0°, 2.0° and 4.0°, respectively.

In the developed script, the rotations are done with an especially programmed MATLAB function. This function

receives as inputs the position of the UMCP first pixel in the context of the upsampled reference frame

(determined as described in the MCP block computation module), the UMCP size, the upsampled reference

frame matrix and the rotation angle to be applied. Then the following steps are performed:

Rotation matrix definition – After the conversion from degrees to radians, the sine and the cosine of

the input angle are determined using the sin [29] and cos [30] MATLAB functions. With these values, it

is possible to define the rotation matrix as in Eq. (4.3).

New coordinate system definition – Then, each block position is converted to the coordinate system

defined before. To do this, each block position is arranged in a column vector and all these column

vectors (one for each block position) are arranged sequentially in a 3-D variable. With this, all these

positions are then multiplied by a factor which basically converts the block positions to the previously

defined coordinate system, centering the coordinates origin at the block centre. This is shown in Figure

4.14 for the 4×4 block used as example in Figure 4.11.

50

Figure 4.14 – Block positions (blue) converted to the adopted coordinate system (red) for the block in Figure

4.11.

Rotation computation – With the new coordinate system defined, each 3-D variable column vector is

then multiplied by the rotation matrix as done in Eq. (4.4) and the obtained values are rounded to the

nearest odd value. These results in a 3-D variable including column vectors representing the rotated

coordinates.

Rotated UMCP block computation – These rotated coordinates are then converted to the

corresponding block positions, performing the inverse operation of the operation illustrated in Figure

4.14. Additionally, the obtained rotated block positions are incremented by the value of the UMCP

block first pixel position. With this, the resulting positions are collected from the upsampled reference

frame to form the rotated UMCP block.

With this rotated UMCP block, it is then possible to process the corresponding shifts.

Shifts processing

After each rotation, the resulting rotated UMCP block can then be shifted according to a parameter δ, expressed

in pixels. With the rotated UMCP block still considered as part of the upsampled reference frame, these shifts are

made by simply incrementing and decrementing δ pixels to each rotated UMCP block pixel position coordinates.

This operation is illustrated in Figure 4.15, where a rotated UMCP block is shifted in all possible directions with

a shift parameter equal to δ.

Figure 4.15 – Shifts applied to a rotated UMCP block with a shift parameter equal to δ for the horizontal and

vertical directions.

From Figure 4.15, it is possible to conclude that the combination of all possible shifts results in 8 shifted UMCP

blocks for each considered rotation. Reusing the window analogy, consider now that the rotated window

(representing the rotated UMCP block) is then displaced δ pixels in all possible directions, leading to 8 new

dispositions, representing the 8 possible shifted and rotated UMCP blocks.

Concerning the available δ values, besides the 0.00, 0.25 and 0.50 pixel shifts used in [15], the developed coding

solution also considers the δ values of 0.75 and 1.00 pixels. The combination of all these shift parameters results

in the set of blocks shown in Figure 4.16.

51

Figure 4.16 – Set of shifted and rotated UMCP blocks for all possible δ combinations (for each θ).

The set in Figure 4.16 includes 80 shifted and rotated UMCP blocks (from block 2 to block 81) and 1 purely

rotated UMCP block (block 1). In Figure 4.16, the blue coloured blocks correspond to those blocks already used

in the solution proposed in [15] (25 in total), while the green coloured blocks correspond to the newly introduced

shift and rotations combinations.

At this stage, with both the rotation and shift computations concluded, the rotated and shifted UMCP blocks can

be downsampled, using the same technique adopted for the MCP block computation. To obtain the set of

estimation prediction error blocks, the set of rotated and shifted MCP blocks is then subtracted from the actual

MCP block. In the developed MATLAB script, these estimation prediction error blocks are stored in a 4-D

variable, with the third dimension corresponding to the available rotations and the fourth dimension

corresponding to the available shifts.

Considering that each TU can have a set of 81 estimated prediction error blocks for each rotation and there can

be a maximum of 5 different rotations, it is possible to obtain a maximum of 405 estimated prediction error

blocks (in the solution presented in [15], a total of 75 estimated prediction error blocks are used).

b. MKLT basis functions computation

With the set of estimated prediction error blocks determined, it is then possible to compute the MKLT basis

functions exactly like it is done in [15]. Thus, first, the covariance matrix Σ of the set of estimated prediction

error blocks needs to be determined. This is achieved by using the equation already presented in Section 3.1,

which defines the covariance between a pixel in position (u,v) and a pixel in position (r,s) for a set of n×n

estimated prediction error blocks as

(4.7)

where u, v, r, s = 0…(n-1), j = u+n.v, k = r+n.s, Ei(u,v) is the estimated prediction error in position (u,v) of the ith

block, is the mean value and N is the number of blocks in the set. To implement Eq. (4.7) in the developed

MATLAB scritpt, a function was programmed whose processing steps are described next:

First, the 4-D variable used to store the set of prediction error blocks is converted to a n2×N matrix, with

each column containing the pixel values of each block arranged in a vector. This conversion is done

with the reshape function [31] included in the MATLAB toolbox.

Then, each row of the obtained matrix, representing the pixel values of a particular position for all the

set blocks, is fixed (working as a pivot) and multiplied element-by-element to all the matrix rows

52

individually. Each of these multiplications results in a N row vector whose elements are summed and

then divided by N2.

As there are n2 rows in the matrix, each pivot row achieves n

2 results from the previous operation.

These results are arranged in a row vector representing the covariance of the pixel position

corresponding to a particular pivot row with all the other pixel positions.

Doing this for all the n2 rows of the matrix, results in a n

2×n

2 matrix representing the covariance matrix.

With the covariance matrix Σ of size n2×n

2 determined, it is then straightforward to compute the eigenvalues and

eigenvectors of this matrix, given by

Σ (4.8)

where is the matrix of eigenvectors and is the diagonal matrix of eigenvalues of the covariance matrix Σ

and both have size n2×n

2. To implement Eq. (4.8) in the developed MATLAB script, the eig function [32]

included in the MATLAB toolbox was used. This function returns the diagonal matrix of eigenvalues and the

matrix of eigenvectors for a particular input matrix, as intended. The transpose matrix of the eigenvectors matrix

represents the MKLT basis functions and can be used to compute the actual transform.

c. Forward MKLT computation

As noted before, the KLT is non-separable and the MKLT inherits this property. In this way, the actual

prediction error has to be arranged in a column vector before its transformation. This is done by laying all the

prediction error block pixels end to end, resulting in a n2 vector for a n×n prediction error block. The forward

MKLT for a input vector x is given by

(4.9)

where cMKLT is the n2 MKLT coefficients vector and TMKLT is the n

2×n

2 MKLT basis functions matrix. To have

some conformity with the DCT, the MKLT coefficients are then rearranged in a n×n block, denominated CMKLT,

using once again the MATLAB reshape function.

With both transforms performed, the following step is the quantization of the obtained coefficients.

4.4.5. Quantization

The quantization of both the DCT and MKLT coefficients is performed by means of a uniform quantizer, i.e., a

quantizer with fixed size for both the input decision intervals and the output reconstruction level differences

[33]. Thus, the quantized coefficients CQ are given by

(4.10)

where C are the coefficients (either DCT or MKLT coefficients) and Qstep is the adopted quantization step. This

quantization step is obtained as in the H.264/AVC standard using the following formula [34]

(4.11)

where QP is the quantization parameter and x%y defines the remainder of the division of x by y. The necessary

reference QP values with their corresponding Qstep values are shown in Table 4.2.

Table 4.2 – Reference QPs with the corresponding Qstep [34].

QP Qstep

0 0.625

1 0.702

2 0.787

3 0.884

4 0.992

53

5 1.114

6 1.250

With both the DCT and the MKLT coefficients quantized, it is then possible to entropy encode them.

4.4.6. Entropy Encoder

The entropy encoder module is the last module of the encoding process. Besides coding the quantized

coefficients to their corresponding bitstreams, this module is also used to decide which of the available

transforms must be used to code a particular TU. This decision is made based on the number of bits necessary to

represent each of the transforms coefficients. In this way, the transform which can be entropy encoded using

fewer bits is selected, and its bitstream is sent to the decoder side. The entropy encoder module architecture is

presented in Figure 4.17.

Figure 4.17 – Architecture of the entropy encoder module.

The entropy encoder module includes the following steps:

1) Transform coefficients scanning

To entropy encode the quantized coefficients, they have to be first arranged in a vector. To do this, the DCT

coefficients are scanned in zigzag order and the MKLT coefficients are rearranged into their original vector

form. In terms of implementation, this is done, for the DCT case, with a function programmed by the author of

this Thesis. This function receives a matrix containing the DCT coefficients and rearranges them according to

the zigzag scanning order, returning the corresponding vector. For the MKLT case, the conversion from a 2-D to

a 1-D representation is performed with the basic MATLAB manipulation tools.

2) Run-level encoder

Then, both coefficient vectors are coded using the run-level coding method used in JPEG [35]. In this method,

the encoder basically organizes the quantized coefficients vector in (run, level) pairs, where the run indicates the

number of null coefficients between the last and the current non-null coefficient and the level indicates the

quantized amplitude of the current coefficient.

54

To implement this encoder, a MATLAB function was programmed. This function uses an auxiliary variable to

store the number of null coefficients, which is initialized with the value 0. Then, each coefficient can be coded in

two different ways:

Null coefficient – If the currently being coded coefficient is null, the auxiliary variable is incremented

and the coding proceeds to the next coefficient.

Non-null coefficient – If the currently being coded coefficient is non-null, the number of null

coefficients since the last non-null coefficients (stored in the auxiliary variable) and the coefficients

amplitude are added to the output string as a (run, level) pair and the auxiliary variable is again

initialized to 0.

Doing this for all the coefficients in both transforms vectors, results in two strings with the corresponding (run,

level) pairs, one for each available transform coefficients.

3) LZ77 encoder

Finally, these strings comprised by (run, level) pairs are entropy encoded using the LZ77 lossless data

compression algorithm [36]. This algorithm is used mainly because of its simple implementation. Since the

entropy coding is not the object of study in this work, the author of this Thesis tried to find a solution that could

exploit the data statistical redundancy in the best possible way, without requiring too much time in its

development and implementation.

The LZ77 algorithm exploits the character redundancy in an input stream by replacing portions of the data with

references to matching data previously processed. To do this, a sliding window is used that is comprised by a

search buffer and a lookahead buffer. The search buffer goes from the beginning of the sliding window to the

character immediately before the current coding position. This buffer is used to search for data matches within

the lookahead buffer, which goes from the current coding position to the end of the sliding window. To better

understand this algorithm, consider the input stream in Figure 4.18 where the third character („B‟) is being

coded.

Figure 4.18 – LZ77 terminology considering the coding of the third character in the input symbol stream.

The output of this encoder is a sequence of (length, distance) pairs followed by the explicit character that was

not found in the search buffer. The length indicates the number of characters that the decoder has to go back in

order to find the beginning of the match, while the distance indicates the number of characters that the decoder

has to copy to its output. In this way, the encoder‟s output for the input stream in Figure 4.18 would be:

(0,0) A; (1,1) B; (0,0) C; (2,1) B; (5,2) End

After some tests performed with blocks of size 4×4, 8×8, 16×16 and 32×32, it was noted that the LZ77 encoder

provides higher compression factors for sliding windows of size n2

– 2 (considering a n×n block). In this way,

for 4×4 TUs, the sliding window size is fixed in 14 characters and, for 8×8 TUs, the sliding window size is fixed

in 62 characters. However, for 16×16 and 32×32 block sizes, the sliding windows are also defined with size 62,

since the use of larger window sizes does not bring a sufficient compression ratio improvement to compensate

the complexity increase. These tests were not performed for 64×64 blocks (another possible transform size in the

HEVC codec) because, at this stage, it was already decided that the maximum transform size used to test the

adopted coding solution would be 32×32 due to the very significant complexity increase caused by the

computation of 64×64 transforms.

To implement this entropy encoder, a MATLAB function was programmed with the following steps:

55

While the lookahead buffer is not empty, the search buffer is searched in order to find the longest match

with the lookahead buffer characters. One of two things can then happen:

o If a match is found, the function adds the corresponding (length, distance) pair and the next

character in the lookahead buffer (referred before as the explicit character) to the function

output. After this, the slinding window is shifted by a distance number of characters.

o If no match is found in the search buffer, the function adds a (length, distance) pair with values

(0,0) and the next character in the lookahead buffer to the function output.

In either case, after the previous step, the slinding window is shifted by the number of coded characters

(if a match was founded this number is equal to distance + 1, otherwise it is just 1). Then, the next

coding position is processed in the same way.

This method is processed until there are no more characters to code in the input stream. It has to be noted that

each (length, distance) pair element is coded with the same number of bits necessary to represent half the sliding

window size value in binary representation. In this way, the entropy decoder can clearly identify each used

(length, distance) pairs.

4) Decision module

With both transform bitstreams generated, it is then possible to select the one to be transmitted to the decoder

side. To do this, the size of the DCT and MKLT bitstreams are compared, and the one represented with fewer

bits is selected. For the decoder to recognize which transform was selected, an extra bit is included in the coded

bitstream; more precisely, a „0‟ is used for the DCT and a „1‟ is used for the MKLT.

With this, the adaptive transform bitstream is sent to the decoder and the encoding process is concluded. In the

next section, the decoding process is analyzed.

4.5. AT Decoder Functional Description and Implementation Details

After the functional description and the explanation of the implementation details of the HEVC framework and

the AT encoder processes, this section is dedicated to the explanation of the AT decoding process. From the

observation of Figure 4.1, it is possible to conclude that the decoding process includes four modules: the

reference frame upsampling, the MCP block computation, the entropy decoder, the inverse quantization, the

inverse adaptive transform and the frame reconstruction modules. These modules will be described and analyzed

in the following sections, with exception of the reference frame upsampling and the MCP block computation

modules that were already explained in the AT encoder section.

It has to be noted at this stage that the decoding process was not implemented independently from the encoder.

Instead, the inverse quantization and the inverse transform of each TU coefficients are performed immediately

after their transform and quantization in the developed MATLAB script. Thus, after the transform selection

made in the entropy encoder, the reconstructed prediction error block for the selected transform is copied to the

reconstructed prediction error frame, using the TU information obtained in the frame partitioning module. In this

way, some of the modules present in the decoding process were not really implemented, since the information

provided by them was already available as no real transmission was performed. This is the case of the entropy

decoder module, that decodes the adaptive transform quantized coefficients, and the frame reconstruction

module, that uses each TU split flag and coding mode to obtain the corresponding TU location and size in

relation to the frame. This approach does not influence the coding performance of the adopted coding solution,

but it just takes advantage of the fact that, in reality, both the encoding and decoding processes were

implemented in the same platform.

4.5.1. Entropy Decoder

To decode the AT bitstream created by the entropy encoder, an entropy decoder is used that basically performs

the inverse operation performed at the encoder. Once again, it has to be noted that this module was not

implemented in the developed MATLAB script, since the data provided by it was already known to the

implementation. The architecture of the entropy decoder module is presented in Figure 4.19.

56

Figure 4.19 – Architecture of the entropy decoder module.

A walkthrough of the processing block present in Figure 4.19 is now presented.

1) Selected transform bit extraction

The first operation to be performed is to read the extra bit that indicates the transform selected for the encoding

process of each TU („0‟ for the DCT and „1‟ for the MKLT).

2) LZ77 decoder

Then, a LZ77 decoder is used to process the (length, distance) pairs contained in the bitstream. This decoder

uses the same sliding window size used in the encoding process. A LZ77 decoder is implemented with the

following steps [37]:

For each (length, distance) pair followed by an explicit character, the value of the variable length is

verified and one of the following steps is taken:

o If length as a value equal to 0, the explicit character is printed to the decoder output.

o Otherwise, a distance number of characters are copied from the current output (before the

process of this step) starting from the character distanced by length positions from the last

position of this output. The copied characters are then added to the output, along with the

explicit character.

This process is done for all bitstream, resulting in a number (run, level) pairs, defined in the forward adaptive

transform module.

3) Run-level decoder

These pairs are then processed to be arranged in a vector. To implement this decoder, considering a n×n TU, a n2

vector must be first pre-allocated with 0s. Then, starting with a pointer to the first position of the vector, for each

(run, level) pair, the number of null coefficients (run) must be added to this pointer and the coefficient amplitude

(level) must be copied to the position indicated by the pointer. This is performed until all the (run, level) pairs are

processed.

4) Coefficients arrangement

Depending on the selected transform, the vector obtained with the run-level decoder is then arranged in a n×n

block using zigzag scanning (for the DCT) or sequential column order (for the MKLT). For the DCT case, if this

module was implemented, a MATLAB function would be developed performing exactly the opposite of the

function developed for the encoder process. Thus, this function would receive the vector with the DCT

coefficients arranged in the zigzag order, and would rearrange them returning a block of DCT coefficients. Once

again, for the MKLT, this operation is trivial using basic MATLAB functions; in this case, the reshape function

57

would be the ideal choice. In both cases, this operation results in a n×n block of reconstructed quantized

coefficients, C’Q.

The decoding process is then continued with the inverse quantization and the inverse adaptive transform

modules.

4.5.2. Inverse Quantization

Having the quantized coefficients levels, it is then necessary to scale them back to their original amplitude. In

this way, the inverse quantization of the n×n block of reconstructed quantized coefficients C’Q is given by

(4.12)

where C’ is the n×n block of reconstructed coefficients and Qstep is the quantization step, obtained in the same

way as in the forward adaptive transform module.

4.5.3. Inverse Adaptive Transform

The inverse adaptive transform module is used to reconstruct the prediction error block for each TU. To do this,

the coefficients received from the inverse quantizer are inverse transformed using the transform indicated by the

selected transform bit, received from the entropy decoder. The architecture of this module is presented in Figure

4.20.

Figure 4.20 – Architecture of the inverse adaptive transform module.

The architecture of Figure 4.20 includes the following steps:

1) Selection module

At this stage, it is then necessary to determine which inverse transform must be computed. This selection is

made according to the selected transform bit available from the entropy decoding process. If this bit is equal to

„0‟, then the inverse DCT is computed; if, on the other hand, the selected transform bit is equal to „1‟, then the

inverse MKLT is computed.

2) Inverse DCT

The inverse DCT of the n×n reconstructed coefficients block C’ is given by

(4.13)

where X’ is the n×n reconstructed prediction error block and TDCT is the n×n DCT basis functions matrix given

once again by

58

(4.14)

In practice, this computation is once again made by obtaining the DCT basis functions matrix with the

MATLAB dctmtx function and then computing the matrix multiplication in Eq. (4.13).

3) Inverse MKLT

If the forward adaptive transform module selected the MKLT, its basis functions have to be once again

computed as they were computed in the forward transform module. Thus, a set of estimated prediction error

blocks is first computed using the same technique described in the forward adaptive transform module. Then, the

estimated prediction error blocks set covariance matrix and the corresponding eigenvectors matrix are

determined. By computing the transpose of the eigenvectors matrix, it is possible to obtain the MKLT basis

functions.

With the MKLT basis functions available, the n×n reconstructed coefficients block C’ has to be arranged in a n2

vector of reconstructed coefficients c’. Once again, this is done because the MKLT inherits the non-separable

property of the standard KTL. Then, the inverse MKLT of the n2 reconstructed coefficients vector c’ is given by

(4.15)

where x’ is the n2 reconstructed prediction error vector and TMKLT is the n

2×n

2 MKLT basis functions matrix.

After the inverse MKLT computation, the reconstructed prediction error vector is rearranged in a n×n block,

representing the prediction error block. In terms of the implementation details, there is nothing to add to the

forward MKLT details explained earlier.

At this stage, independently of the selected transform, this module ends its processing, having obtained the

reconstructed prediction error block. Then, all TUs prediction error blocks are sent to the frame reconstruction

module to be arranged in the final prediction error frame.

4.5.4. Frame Reconstruction

As referred before, this module is used to arrange the various prediction error blocks, corresponding to the inter-

coded TUs in which the frame was partitioned, in a single frame. In the developed implementation, this is done

by simply copying the prediction error blocks pixel values to their corresponding location in a matrix with the

size of the original frame. The location of each prediction error block is the same as the location of the

corresponding TU. Thus, using the information about the first pixel position and the size of each TU, it is

straightforward to obtain the final reconstructed prediction error frame. However, considering two separate

encoding and decoding platforms, the frame partitioning module described in the encoding process would have

to be computed once again, using the TU split flags and coding modes, to obtain the necessary TU information

(i.e. first pixel position and size).

With this module, the decoding process is concluded.

4.6. Summary

In this chapter, the coding solution developed and implemented in this Thesis was presented to the reader. This

solution is based on the solution proposed in [15] as it uses an adaptive transform than can switch between the

DCT and a modified KLT, depending on the content that is being coded. The main differences regarding the

solution in [15] are related to the codec with which the adaptive transform is combined, the HEVC standard and

not the H.264/AVC standard, and the set of shift and rotation parameters used for the MKLT prediction error

estimation process. These differences were explained in detail in this chapter.

With this chapter, the reader was introduced not only to the main concepts behind the adopted solution, some

already presented in Section 3.1, but also to its implementation details. At this stage, it is possible to proceed to

the evaluation of the performance of the adopted coding. This is the objective of the next chapter which includes

a detailed performance assessment of the implemented coding solution.

59

Chapter 5

Performance Evaluation

The main purpose of this chapter is to evaluate the performance of the video coding solution designed in Chapter

4 combining the HEVC codec under development and the adaptive transform described in [15]. This assessment

is the natural final step to check the utility and effectiveness of this solution in the current video coding

landscape this means also in comparison with the relevant already available benchmarks. For this, a number of

experiments have been conducted with the proposed video coding solution, notably in terms of the adopted

transform. To achieve meaningful results, appropriate tests conditions have to be adopted; these conditions are

presented in the first section of this chapter, including the video sequences details and the coding parameters, as

well as the considered benchmarks and the metrics used to assess the coding performance. After, the test results

are presented, followed by their analysis.

5.1. Test Conditions

To evaluate the performance of the proposed video coding solution in a solid and reliable way, appropriate test

conditions have to be first defined. This is also done to avoid differences in the testing methodology from one

experiment to another which may lead to misleading results and conclusions. With this in mind, the next

subsections will present first the video sequences details and the coding parameters; after, the assessment metrics

and the benchmarks selected to evaluate the proposed coding solution performance are described.

5.1.1. Video Sequences

To obtain the results needed to evaluate the performance of the designed video coding solution, it is first

necessary to select the video sequences to be coded. These video sequences will play a major role in the obtained

results and derived conclusions, since their characteristics can heavily influence the behavior of the video codec

under test.

Spatial and temporal resolutions

For this study, two types of video resolutions have been used:

60

CIF resolution corresponding to 352×288 samples for the luminance and half this resolution in each

direction for the chrominances (4:2:0 content); for this spatial resolution, the adopted frame rate has been

30 fps. This resolution is used to allow the comparison of the adopted coding solution performance with

the results obtained in [15], which also used CIF resolution video sequences.

HD resolution corresponding to 1920×1080 samples for the luminance and half this resolution in each

direction for the chrominances (4:2:0 content); for this spatial resolution, the adopted frame rate has been

24 fps as this is the combination adopted by the JCT-VC team. This resolution has been selected as this

is one of the main target resolutions for the HEVC standard currently under development.

The selected video sequences corresponding to these resolutions are presented next.

CIF video sequences

Three CIF resolution video sequences have been selected: Container, Foreman and Mobile. All selected CIF

video sequences include 300 frames and the full sequences have been coded for the obtained performance

results. The first frames of these video sequences are presented in Figure 5.1.

Figure 5.1 – First frame of the selected CIF video sequences.

Figure 5.1 (a) shows the first frame of the Container video sequence. In this sequence, the video camera

basically follows a container ship movement (i.e. panning movement); this results in small motion activity. In

terms of spatial complexity, this video sequence includes rather homogenous areas, with a large portion of the

frame dominated by the sea. For this type of content, it is possible to use larger coding blocks, which provide

similar quality to the use of various smaller coding blocks, but using a smaller number of bits. The only spatial

detail requiring smaller coding blocks is the waving flag, whose movement cannot be too well predicted.

In Figure 5.1 (b), the first frame of the Foreman sequence is presented. The first frames of this sequence have

almost no motion, just including small movements of the speaking person‟s head. At approximately frame 160, a

fast camera panning is done, showing a completely new scenario with a building under construction. With this

panning, this sequence can be considered to have high motion activity. In terms of spatial details, it has to be

noted that the background building has some strong directional edges in the various floors separations that can

be harder to code with the DCT.

The first frame of the Mobile sequence is presented in Figure 5.1 (c). This sequence has consistent and medium

motion activity. However, it has a large number of spatial details, particularly in the calendar illustration; these

spatial details may cause the encoder to use smaller TU sizes in order to code them in a very efficient way.

HD video sequence

To test the adopted coding solution performance for HD video sequences, the Kimono sequence was selected.

Only one HD sequence was tested since the computation of the developed MATLAB script for this type of video

resolution takes a long time to fully process, and so this was the only sequence whose coding finished before this

Thesis submission. This video sequence includes 240 frames, but, in this case, only the first 50 frames were

coded with the adopted coding solution. This decision is related once again to the large amount of time needed to

code each of these high resolution sequences. The first frame of the Kimono sequence is presented in Figure 5.2.

61

Figure 5.2 – First frame of the selected HD video sequence: Kimono sequence.

The motion activity present in the first 50 frames of the Kimono sequence, which are the frames coded for this

sequence, is similar to the motion activity of the Container sequence. In this case, a smooth panning is made to

follow a woman‟s movement across a field with trees. With this, there are only small changes in the woman‟s

details, e.g. facial expression and clothes. The background changes slowly with the panning movement, but not

abruptly, since it is always dominated by trees and leaves.

5.1.2. Coding Conditions

After the definition of the test video sequences, it is then necessary to define the conditions and parameters used

in the coding process. Thus, the HEVC encoder configuration and the adaptive transform parameters are

specified in the following.

HEVC encoder configuration

To encode the selected test video sequences using the HEVC codec (TMuC software, version 0.9 [20]), the

“Random access, high-efficiency setting” defined by the JCT-VC team has been used (described in Section 4.3 of

[38]). This configuration was used since the objective of this work is to study the coding efficiency of the

developed coding solution and not its complexity and this is the appropriate JCT-VC defined configuration for

this purpose. This configuration is used with the following parameters:

Largest and smallest CTB size – To perform the tests, 32×32 and 4×4 sizes are used for the LCTB

and SCTB sizes, respectively. The HEVC encoder uses a rate-distortion optimization method to decide

how to partition the frame in CTBs. Thus, all the possible partitioning solutions are tested before this

decision is made. Since the maximum transform size was already defined to be 32×32 (as referred in

Chapter 4), there is no need to waste encoding time with CTBs bigger than 32×32.

Maximum and minimum transform size – The TU maximum and minimum sizes are also defined as

32×32 and 4×4, respectively. The maximum transform size is limited to 32×32 due to the high

computational time required by the application of 64×64 transforms, both for the HEVC codec and for

the developed MATLAB script, as already referred.

GOP structure and size – Each GOP starts with an intra-coded frame (I-frame) and is followed by P

inter-coded frames (P-frames) until its end (i.e. IPPP…P). The GOP size, corresponding to the period

between two intra-coded frames, is equal to 24 for both CIF and HD sequences.

Single reference frame – To simplify the implementation of the adopted coding solution, the motion

prediction is always based on only one reference frame, the previously coded frame.

Quantization parameters – To allow the performance evaluation of the various coding solutions for

several quality levels, five quantization parameters have been adopted: 16, 22, 27, 32 and 37. The last

four quantization parameters were selected according to the recommendation made by JCT-VC in [38];

the QP of 16 was added to extend the performance evaluation to higher bitrates. To determine the Qstep

values for the selected QPs, Eq. (4.11) was used, resulting in the Qstep values presented in Table 5.1.

62

Table 5.1 – Selected QPs and their corresponding Qstep values.

QP Qstep

16 63.4880

22 126.9760

27 226.3040

32 402.9440

37 718.8480

Finally, it has to be noted that the deblocking filter and the rotational transform were disabled in the TMuC

software. For the deblocking filter, this was done to allow the correct verification of the extracted data without

the need to recreate this process. The rotational transform was disabled to save the required computation effort as

its use was not essential in the context of this study. With the HEVC codec configuration defined, the adaptive

transform parameters are described next.

Adaptive transform parameters

To compare the adopted coding solution performance with the benchmarks defined in the following, three

coding modes for the proposed adaptive transform have been defined with the following parameters:

Adaptive transform with half range shift and rotation parameters – This mode of the adaptive

transform uses a Half Range shift and rotation parameters Set (HRS) to compute the MKLT basis

functions. This means that the maximum shift parameter is δ = 0.5 pixels and the maximum rotation

parameter is θ = 0.5°; this AT mode is basically the same as used in [15].

Adaptive transform with full range shift and rotation parameters – In this mode, the used MKLT

basis functions are computed with a Full Range shift and rotation parameters Set (FRS). This means

that the maximum shift parameter is δ = 1.0 pixels and the maximum rotation parameter is θ = 1.0°; this

AT mode uses more shifts and rotations to estimate the prediction error than those used in [15].

Adaptive transform with HRS and FRS – With this adaptive transform coding mode, the MKLT is

basically divided into two MKLTs: one using the HRS mode and another using the FRS mode. In this

way, at the decision module, the selection is made between 3 transforms: the DCT, the MKLT with

HRS and the MKLT with FRS. It has to be noted that the inclusion of this AT mode requires 2 bits to

signal the selected transform, instead of just 1 bit as used for the AT solution scheme proposed in [15].

With the coding conditions clearly defined, the metrics used to evaluate each solution performance are presented

next.

5.1.3. Performance Evaluation Metrics

To evaluate the performance of the various coding solutions, their Rate-Distortion (RD) curves are obtained.

These curves are obtained by plotting the objective quality metric value for the reconstructed prediction error as

a function of the amount of bits per second needed to code it. In the following, the adopted objective quality

metric is the PSNR as it is commonly done in the literature. The PSNR and the bitrate metrics are defined as:

PSNR – The Peak Signal-to-Noise Ratio is a metric used to measure the ratio between the maximum

possible power of a signal and the power of the corrupting noise affecting the fidelity of its

representation [39]. In the video coding context, the PSNR is used to measure the objective quality of

the decoded signal in comparison to the original input signal. It is commonly defined via the Mean

Squared Error (MSE), which, for a m×n frame, is given by

(5.1)

where O(i, j) is the original input signal at position (i, j) and D(i, j) is the decoded signal at position (i, j).

With this, the PSNR is given by

63

(5.2)

where MAX is the maximum value of the input signal (255 for 8 bits samples). In the context of this study,

the PSNR is used to measure the objective quality of the reconstructed prediction error in comparison to the

actual prediction error.

Bitrate – The bitrate metric is typically used to measure the amount of data needed to code the original

input signal. However, in the context of this work, it represents the number of bits per second (bits/s)

needed to code a particular video sequence prediction error. In this case, the bitrate is not defined and

controlled directly through a rate control tool but it is the result of selecting a particular QP.

Naturally, the RD curves are expected to show lower PSNR values for the lower bitrates and higher PSNR

values for the higher bitrates. With the RD curves, it is possible to evaluate the average PSNR improvements and

the average bitrate savings of a coding solution against another. This may be performed by means of a largely

used metric called Bjontegaard metric [40], which is described next:

Average PSNR improvement of one solution versus another – To measure the average PSNR

improvement of a particular coding solution over another, first, both coding solutions RD curves bitrate

axes have to be converted to a logarithmic scale. Then, the resulting RD curves are approximated by

cubic functions, in what is called a fitting process. With the cubic functions for both coding solutions

available, it is then simple to compute the integral of both functions in a given interval (ranging from

the minimum to the maximum available bitrate values). The computation of the difference between

these two integral values results in the average PSNR improvement between the two coding solutions.

Average bitrate savings of one solution versus another – To make an evaluation of the average

bitrate savings between two coding solutions, the RD curves have to be first inverted, in order to

provide the bitrate as a function of the PSNR. Then, the bitrate axes are again converted to the

logarithmic scale and the resulting RD curves are approximated by a cubic function. With this, it is then

possible to compute the difference between the integrals of both coding solutions cubic functions in the

same interval (ranging from the minimum to the maximum available PSNR values), resulting in the

average bitrate saving between them.

The Bjontegaard metric described above has been computed by means of a MATLAB script developed by

Valenzise, available in [41].

5.1.4. Coding Benchmarks

Clearly, the main feature of the developed video coding solution is the adaptive transform. In this way, the

performance evaluation reported in this chapter had to focus on the coding performance changes related to the

use of this transform coding tool, notably in comparison with the usual DCT. To access these changes, the

following coding solutions are benchmarked:

HEVC with DCT (DCT) – In this codec, relevant coding data is extracted from the HEVC codec, as

explained in Chapter 4, and after, instead of the proposed adaptive transform, the DCT is used for all

inter-coded TUs.

HEVC with the adaptive transform with HRS(AT HRS) – With this codec, the relevant coding data

is once again extracted from the HEVC codec and then the inter-coded TUs are transformed using the

proposed adaptive transform with HRS.

HEVC with adaptive transform with FRS (AT FRS) – To test the performance changes related to the

introduction of the new shift and rotation parameters, a codec using the proposed adaptive transform

with FRS is also used. Once again, this codec uses the relevant coding data extracted from the HEVC

codec.

HEVC with adaptive transform with HRS and FRS (AT HFRS) – Finally, the performance of a

codec using the adaptive transform with both HRS and FRS is assessed. This is a special case of the

proposed adaptive transform, which may choose between 3 transforms as explained before. To perform

the coding of the inter-coded TUs, this codec also uses the relevant coding data extracted from the

HEVC framework.

64

With these benchmarks, it is possible to evaluate the relative coding performance of the proposed adaptive

transform with the following objectives in mind:

To assess if the proposed HEVC with adaptive transform solution can obtain similar performance gains

to those achieved in [15] (where the adaptive transform was integrated in the H.264/AVC codec), but

now in the context of the emerging HEVC standard.

To evaluate the coding performance of the proposed HEVC with adaptive transform solution for high

definition video content, whose adoption is growing quickly, when compared to its coding performance

for lower resolution video sequences (i.e. CIF).

To evaluate the coding performance of the proposed HEVC with adaptive transform solution when

using larger shift and rotation parameters than those used in [15].

Having clearly defined the test contents, conditions and benchmarks, the results of the experiments performed

are presented next, followed by their analysis.

5.2. Results and Analysis

To analyze the obtained experimental results, the RD performance results obtained for the various tested video

sequences are presented first for the CIF video sequences and after for the HD sequence. These results include

RD curves for the three individual transforms that can be selected by the adaptive transform - DCT, MKLT HRS

and MKLT FRS - and also for the RD curves corresponding to the four codecs defined as benchmarks: the

HEVC codecs using the proposed adaptive transform – AT HRS, AT FRS and AT HFRS – and the HEVC codec

using the DCT. Naturally, the DCT transform RD curve corresponding to the individual transform comparison

and the HEVC with DCT RD curve are going to be the same. Additionally, the Bjontegaard metric is applied to

each adaptive transform mode versus the DCT, to measure the average PSNR improvement and the average

bitrate saving for each one of these codecs.

Besides the RD performance-based results, the statistics about the used TU sizes and the selected transforms for

each adaptive transform codec are also presented. These results are used to better understand the proposed

adaptive transform selection process.

5.2.1. Performance for CIF Resolution Video Sequences

As mentioned before, three CIF resolution video sequences have been coded to assess the adopted coding

solution: Container, Foreman and Mobile. The RD performance results obtained with these video sequences are

presented in the following.

Container video sequence

The Container sequence RD performance obtained for the DCT, the MKLT HRS and the MKLT FRS is shown

in Figure 5.3. From this figure, it is possible to take the following conclusions:

The first observation that has to be made is related to the achieved bitrate values for this and for all the

tests performed in this work. It is clear that these values are extremely higher than the ones achieved

with the actual state-of-the-art video coding standard. However, the author has perfect notion of this

fact and, thus, these values are never used in their absolute form to take any type of conclusions about

the adopted coding solution performance. Instead, these values are always analyzed in a relative way.

The reason behind these high bitrate values is principally related to the entropy coder used in this

solution. As explained in Chapter 4, it is a very simple entropy coder which is not the object of study of

this work. Additionally, is also true that the HEVC codings are always performed using only P-frames

and a single reference frame, which do not allow the exploitation of all the motion prediction tools.

Comparing the MKLTs performance to the DCT, it is possible to see that both MKLTs can only

outperform the DCT for the low bitrates (corresponding approximately to the first two QPs). For

bitrates larger than 2 Mbit/s, the DCT starts to offer better prediction error PSNR than the MKLTs

(considering the same bitrate). This improvement becomes even more evident for the higher bitrates.

Comparing now the performance of both MKLTs, it seems that the use of the FRS versus the HRS

(used in [15]) can bring slight improvements in terms of RD performance. This improvement is almost

imperceptible for the lower bitrates, where the RD curve for the MKLT FRS and the MKLT HRS are

basically the same.

65

Figure 5.3 – Container sequence RD performance for the DCT, MKLT HRS and MKLT FRS.

To see how these results can influence the adaptive transform performance for the Container sequence, Figure

5.4 shows the RD performance for the DCT, AT HRS, AT FRS and AT HFRS codecs, while Table 5.2 shows

the results of the Bjontegaard metric for the three adaptive transform modes against the DCT.

Figure 5.4 – Container sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS codecs.

66

Table 5.2 – Container sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT.

Benchmark Average PSNR improvement

(dB) Average bitrate saving (%)

AT HRS versus DCT 0.54 6.5

AT FRS versus DCT 0.54 6.5

AT HFRS versus DCT 0.71 8.4

From the analysis of Figure 5.4 and Table 5.2, it is possible to conclude the following:

First, it is possible to observe that all adaptive transform modes (AT HRS, AT FRS and AT HFRS)

bring performance improvements in comparison to the codec only using the DCT. As expected from the

observation of Figure 5.3, these improvements are more noticeable for lower bitrates, approaching the

DCT RD curve for higher bitrates. For the entire bitrate range, the AT HFRS is always the best

performing codec in terms of RD performance, has proven by the average PSNR improvement (0.71

dB) and the average bitrate saving (8.4%) when compared to the DCT.

It can also be noted that visual inspection allows concluding that the slight performance improvement

offered by the MKLT FRS in relation to the MKLT HRS does not materialize in any perceptual coding

gain when using the adaptive transform. In this way, both the AT HRS and the AT FRS have the same

average PSNR improvement (0.54 dB) and average bitrate saving (6.5%) in comparison to the DCT.

To better understand how these performance gains are achieved, the transform selection made by the adaptive

transform is now analyzed in detail. For this effect, Table 5.3 shows the percentage of inter-coded TUs for each

QP and TU block size and Table 5.4 shows the percentage of TUs coded with each available transform for each

AT codec, QP and TU block size, all this for the Container sequence.

Table 5.3 – Container sequence percentage of inter-coded TUs for each QP and TU block size.

QP TU sizes

4×4 8×8 16×16 32×32

16 54% 36% 9% 2%

22 44% 38% 14% 4%

27 33% 35% 20% 13%

32 20% 28% 23% 30%

37 10% 18% 21% 51%

Starting by the analysis of Table 5.3, it is possible to observe that, for lower QP values (thus higher bitrate

values), the HEVC codec tends to select a higher percentage of smaller TU sizes (54% of the TUs having size

4×4, 36% size 8×8, 9% size 16×16 and only 2% size 32×32), since these blocks can achieve better performance

in a RD optimization sense by offering a better coding efficiency. By increasing the QP values (thus reducing

the bitrate), the TU partition becomes more balanced in terms of the percentage of block sizes selected, with a

gradual increase of the larger blocks selection (and a consequent reduction of the smaller blocks selection). For

the higher QP values (thus lower bitrates), the selection pattern observed for lower QPs is completely reversed

with the larger TU sizes selected for the majority of the cases (10% of the TUs having size 4×4, 18% size 8×8,

21% size 16×16 and 51% size 32×32). Although the TU partitioning pattern depends greatly on the video

sequence motion activity and spatial details, it can be said that this trend, i.e., the reduction of the selection of the

smaller blocks and the increment of the use of the larger blocks with a QP value increase, is observed for all the

studied cases.

Focusing now on the results shown in Table 5.4, the following conclusions may be driven:

For all TU sizes and adaptive transform codecs, an increase of the QP value results into an increase of

the percentage of TUs coded with the MKLTs. This increase is more noticeable for the larger TU block

67

sizes, e.g. the MKLT HRS used in the AT HRS is selected for only 26% of the 32×32 TUs for a QP of

16, but, for a QP of 37, it is selected 95% of the times for the same TU size.

For smaller block sizes (4×4 and 8×8 TUs), the choice between the DCT and the MKLT in the AT HRS

and the AT FRS codecs is fairly balanced, with the maximum difference occurring for the 8×8 TUs and

QP of 16 in the AT HRS case, where the DCT is selected 69% of the times. For the AT HFRS case, the

MKLTs are selected, on average, 64% of the 4×4 TUs and 61% of the 8×8 TUs.

For larger block sizes (16×16 and 32×32 TUs), the disparity between the DCT and the MKLTs

selection is, in general, very high. For the lower QP values (16 and 22), the DCT is selected in the

majority of the cases for all the available AT codecs, on average, 68% of the times. It has to be noted

that, for these QP values, these larger blocks are not used very often (as referred before). For higher QP

values (27, 32 and 37), the MKLTs become the most used transforms for all the AT codecs, being

selected, on average, 83% of the times.

Comparing the AT HRS with the AT FRS, it is possible to conclude that the percentage of TUs coded

with the MKLT HRS is practically the same as the percentage of TUs coded with the MKLT FRS. On

the other hand, from the AT HFRS results, it is possible to observe that the MKLT FRS is selected

more times than the MKLT HRS. This was expected since the decision module selects the MKLT FRS

for the cases where both MKLTs bitstreams have the same number of bits. With this option, the author

pretends to use the novel shift and rotation parameters in the maximum number of opportunities

possible to evaluate the performance changes introduced by their utilization. If the developed solution

complexity was the major requirement, instead of its coding performance, clearly the MKLT HRS

should be the one to be selected in these cases.

It can also be noted that the use of both MKLTs in the same codec increases the MKLT percentage use

in comparison with the situation where they are used independently. This happens for all the studied

cases.

These results show that, for a CIF video sequence with low motion activity and few spatial details like the

Container sequence, the best performing adaptive transform (AT HFRS) can bring coding improvements over

the DCT of 0.71 dB in terms of objective prediction error quality and of 8.4% in terms of bitrate savings. They

also show that the use of the FRS alone cannot bring significant performance gains to the adaptive transform

(this means regarding the HRS). However, when used in an adaptive transform combining both the available

shift and rotation parameter sets, it can provide a considerable performance improvement in comparison to the

adaptive transform with only the HRS (as used in [15]).

68

Table 5.4 – Container sequence percentage of TUs coded with the available transforms for each AT codec, QP

and TU block size.

Codec Selected

transform

TU sizes

4×4 8×8 16×16 32×32

QP = 16

AT HRS DCT 53% 69% 86% 74%

MKLT HRS 47% 31% 14% 26%

AT FRS DCT 52% 67% 85% 74%

MKLT FRS 48% 33% 15% 26%

AT HFRS

DCT 38% 54% 78% 70%

MKLT HRS 27% 21% 10% 14%

MKLT FRS 35% 25% 12% 15%

QP = 22

AT HRS DCT 50% 60% 68% 64%

MKLT HRS 50% 40% 32% 36%

AT FRS DCT 50% 59% 67% 64%

MKLT FRS 50% 41% 33% 36%

AT HFRS

DCT 36% 45% 57% 59%

MKLT HRS 26% 25% 20% 18%

MKLT FRS 38% 30% 23% 23%

QP = 27

AT HRS DCT 49% 52% 49% 37%

MKLT HRS 51% 48% 51% 63%

AT FRS DCT 49% 51% 47% 36%

MKLT FRS 51% 49% 53% 64%

AT HFRS

DCT 36% 37% 37% 30%

MKLT HRS 25% 28% 29% 31%

MKLT FRS 40% 35% 34% 39%

QP = 32

AT HRS DCT 47% 46% 34% 15%

MKLT HRS 53% 54% 66% 85%

AT FRS DCT 47% 46% 34% 14%

MKLT FRS 53% 54% 66% 86%

AT HFRS

DCT 35% 32% 24% 10%

MKLT HRS 23% 28% 31% 38%

MKLT FRS 42% 40% 45% 52%

QP = 37

AT HRS DCT 49% 42% 25% 5%

MKLT HRS 51% 58% 75% 95%

AT FRS DCT 48% 41% 25% 5%

MKLT FRS 52% 59% 75% 95%

AT HFRS

DCT 36% 29% 16% 2%

MKLT HRS 22% 27% 30% 28%

MKLT FRS 42% 44% 54% 70%

69

Foreman video sequence

After the presentation and analysis of the performance results for the Container sequence, Figure 5.5 shows the

obtained RD performance for the DCT, the MKLT HRS and the MKLT FRS transforms for the Foreman video

sequence.

Figure 5.5 – Foreman sequence RD performance for the DCT, MKLT HRS and MKLT FRS.

From Figure 5.5, it is possible to conclude:

First, the DCT clearly outperforms both MKLTs when used for all the video sequence TUs, thus always

providing better objective quality for the same bitrate. This does not mean that the MKLTs are not

useful at all in a more adaptive coding solution, as there can be a number of TUs which might be coded

more efficiently using a MKLT than a DCT; naturally, this will require a more complex, adaptive

transform.

It is also possible to observe that the use of the extended range of shift and rotation parameters (FRS)

can provide slightly better RD performance when compared to the HRS approach used in [15]. It

remains to be seen if this RD performance improvement is also reflected in the adaptive transform

performance. Moreover, since these RD performance gains are rather small, the associated complexity

increase may not be worthwhile.

To evaluate the proposed adaptive transform performance for the Foreman sequence, Figure 5.6 shows the RD

performance for the DCT and the three previously defined adaptive transform modes. Table 5.5 shows the

Bjontegaard metric results for the same adaptive transform modes versus the DCT.

70

Figure 5.6 – Foreman sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS.

Table 5.5 – Foreman sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT.



AT HRS vs. DCT 0.31 4.6

AT FRS vs. DCT 0.32 4.6

AT HFRS vs. DCT 0.44 6.4

From Figure 5.6 and Table 5.5, the following analysis can be made:

First, it is possible to conclude that all the three adaptive transform modes can achieve better RD

performance than the DCT alone for all the tested bitrate range, with average PSNR improvements

varying from 0.31 to 0.44 dB and average bitrate savings from 4.6 to 6.4 %.

It can also be confirmed that the use of the FRS mode brings performance improvements to the adaptive

transform in comparison to the HRS mode, but these improvements are almost meaningless in the video

coding context (only 0.01 dB of average PSNR improvement). On the other hand, the use of the FRS

mode in combination with the HRS mode can bring a more significant improvement in relation to the

use of the HRS mode alone, despite the need of one more bit for the transform signalling (0.13 dB of

average PSNR improvement and 1.8% of average bitrate reduction). Again, this implies an encoding

and decoding complexity increase that needs to be assessed regarding the RD performance gains.

Following the RD performance results of the Foreman sequence, Table 5.6 shows the percentage of inter-coded

TUs for each QP and TU block size; moreover, Table 5.7 shows the percentage of TUs coded with the available

transforms for each AT codec, QP and TU block size, for the Foreman video sequence.

71

Table 5.6 – Foreman sequence percentage of inter-coded TUs for each QP and TU block size.

QP TU sizes

4×4 8×8 16×16 32×32

16 69% 27% 4% 0%

22 55% 35% 8% 1%

27 34% 45% 17% 4%

32 15% 42% 30% 13%

37 4% 34% 34% 29%

Comparing the results in Table 5.6 with those obtained with the Container sequence, it is possible to see that the

sequence Foreman tends to use a higher percentage of smaller TU blocks. This was expected since the Foreman

sequence has higher motion activity and more spatial details than the Container sequence. However, it can still

be verified that the use of larger TU blocks grows with the QP value. In this particular case, only the 4×4 TUs

show a decreasing use with the raise of the QP. From Table 5.6, it is also possible to conclude that, for QP

values of 16 and 22, the use of 32×32 TUs is almost inexistent.

Table 5.7 shows very similar results to those obtained for the sequence Container. Still, the following

conclusions may be taken:

Once again, the percentage of TUs coded with the MKLTs increases with the QP value. This is

especially noticeable for the larger TUs, as for the Container sequence, but also for the 8×8 TUs which

show a similar behaviour to the larger blocks in this case (e.g. for the AT HRS codec, the MKLT HRS

is selected only for 28% of 8×8 TUs for a QP of 16, while it is selected for 66% of the same TUs for a

QP of 37).

In this case, the smaller TUs (4×4 and 8×8) have more importance, since there are more TUs of these

sizes than in the previous sequence. In this way, the MKLTs are selected for the 4×4 TUs, on average,

54% of the times, while the 8×8 TUs (which are the most used type of TU for this sequence) are coded

by the MKLTs, on average, 38% of the times.

The larger TUs (16×16 and 32×32), which are rarely used for the first three QP values (16, 22 and 27),

show once again a large disparity between the first and the last QP (i.e. 16 and 37). For 16×16 TUs,

49% of the blocks are coded with the MKLTs, on average, while for the 32×32 TUs this number raises

to 58%.

Once again, the MKLT FRS is not selected more times than the MKLT HRS when operating

independently with the AT FRS and AT HRS codecs. However, as for the Container sequence, the

MKLT FRS is, in general, selected more often than the MKLT HRS, for the AT HFRS.

These results show that, for a CIF video sequence with a high amount of motion activity and medium spatial

details complexity such as the Foreman video sequence, the best adaptive transform (MKLT with HRS and

FRS) can bring performance improvements over the DCT of 0.44 dB in terms of objective reconstructed

prediction error quality and 6.4% in terms of bitrate savings (always on average). Once again, it was verified that

the use of FRS can only bring coding gains when combined with the HRS mode.

72

Table 5.7 – Foreman sequence percentage of TUs coded with the available transforms for each AT codec, QP

and TU block size.

Codec Selected

transform

TU sizes

4×4 8×8 16×16 32×32

QP = 16

AT HRS DCT 56% 72% 88% 88%

MKLT HRS 44% 28% 12% 12%

AT FRS DCT 55% 71% 88% 87%

MKLT FRS 45% 29% 12% 13%

AT HFRS

DCT 41% 59% 82% 85%

MKLT HRS 27% 19% 8% 7%

MKLT FRS 32% 22% 10% 8%

QP = 22

AT HRS DCT 53% 63% 70% 63%

MKLT HRS 47% 37% 30% 37%

AT FRS DCT 53% 62% 69% 64%

MKLT FRS 47% 38% 31% 36%

AT HFRS

DCT 39% 49% 59% 56%

MKLT HRS 25% 23% 18% 18%

MKLT FRS 36% 28% 23% 25%

QP = 27

AT HRS DCT 52% 54% 53% 37%

MKLT HRS 48% 46% 47% 63%

AT FRS DCT 51% 54% 52% 37%

MKLT FRS 49% 46% 48% 63%

AT HFRS

DCT 38% 41% 42% 30%

MKLT HRS 23% 25% 25% 30%

MKLT FRS 39% 34% 33% 40%

QP = 32

AT HRS DCT 49% 44% 39% 21%

MKLT HRS 51% 56% 61% 79%

AT FRS DCT 48% 44% 37% 21%

MKLT FRS 52% 56% 63% 79%

AT HFRS

DCT 37% 33% 29% 16%

MKLT HRS 20% 24% 28% 36%

MKLT FRS 43% 43% 43% 48%

QP = 37

AT HRS DCT 44% 34% 21% 10%

MKLT HRS 56% 66% 79% 90%

AT FRS DCT 43% 35% 21% 9%

MKLT FRS 57% 65% 79% 91%

AT HFRS

DCT 34% 25% 15% 7%

MKLT HRS 15% 23% 29% 35%

MKLT FRS 51% 52% 56% 58%

73

Mobile video sequence

After the analysis of the results for the Foreman sequence, the results obtained for the Mobile video sequence are

presented next. In Figure 5.7, the RD performance for the available individual transforms to be used later by the

adaptive transform are shown.

Figure 5.7 – Mobile sequence RD performance for the DCT, MKLT HRS and MKLT FRS.

From the results in Figure 5.7, the following conclusions can be taken:

Once again, the DCT shows a better RD performance than the two available MKLTs, particularly for

the higher bitrates.

The use of the FRS mode brings once again residual RD performance benefits in comparison to the

HRS mode. Taking in consideration the two previously studied cases (Container and Foreman

sequences), it is expectable that this improvement will not be reflected in the RD performance of the

AT FRS versus the AT HRS solutions. However, again from the previous results, it is predictable that

the use of both MKLTs in the AT HFRS will bring some RD performance improvement over the other

two adaptive transform modes.

Next, the RD performance for the DCT and the three available adaptive transform modes are presented in Figure

5.8, followed by the corresponding Bjontegaard metric results also for these three adaptive transform versus the

DCT in Table 5.8.

74

Figure 5.8 – Mobile sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS.

Table 5.8 – Mobile sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT.






The obtained results lead to the following conclusions:

As for the previous video sequences, all the adaptive transform modes can bring performance

improvements when compared to the DCT used alone. In this case, these improvements range from

0.47 to 0.68 dB for the average PSNR improvement and from 4.0 to 5.8 % for the average bitrate

reduction.

As expecteed, both prediction error estimation modes (FRS and HRS) provide similar RD performance

when used independently; with the FRS mode performing slightly better (0.02 dB of average PSNR

improvement and 0.2 % of average bitrate reduction). The combination of these two prediction error

estimation modes provides once again the best solution in terms of RD performance (with a 0.21 dB

improvement of the average PSNR and 1.8% of average bitrate saving in comparison to the adaptive

transform with the HRS mode).

Consider now the percentage of inter-coded TUs for each QP and TU block size for the Mobile sequence in

Table 5.9. The percentage of TUs coded with the available transforms for each AT codec, QP and TU block size

for the same sequence are presented in Table 5.10.

75

Table 5.9 – Mobile sequence percentage of inter-code TUs for each QP and TU block size.

QP TU sizes

4×4 8×8 16×16 32×32

16 82% 16% 2% 0%

22 77% 20% 2% 0%

27 69% 27% 3% 0%

32 53% 39% 7% 1%

37 32% 45% 19% 4%

As referred in Section 5.1.1, the Mobile sequence has a large amount of spatial details. Thus, as expected, the TU

partitioning for this sequence is largely dominated by smaller TU blocks, with 4×4 and 8×8 sizes. Despite this,

the 16×16 TUs still have a significant use for a QP of 37 (with 19% of the blocks). On the other hand, the TUs

with size 32×32 are almost inexistent, with a maximum of 4% of blocks for a QP of 37. By observation of Table

5.10, the following conclusions may be taken:

Like the two previous cases (Container and Foreman sequences), the percentage of selected MKLTs

for every AT codec and TU block size grows with the QP increase. Once again, this growth is more

evident for the larger sized TUs.

For the smaller TUs, the selection between the DCT and the MKLTs is very balanced, with an average

of 54% and 47 % of the coded TUs selecting the MKLTs for 4×4 and 8×8 TUs, respectively.

For the larger TUs, it is only important to analyze the results for the 16×16 TUs for a QP of 37, since

all the other results are obtained for an insignificant number of TUs. Thus, the MKLTs are selected for

the 16×16 TU coding, on average, 61% of the times, for a QP of 37.

In terms of the comparison between the MKLT HRS and the MKLT FRS, both transforms show similar

selection percentages when operated individually. However, once again, for the AT HFRS codec, the

MKLT FRS is selected more often than the MKLT HRS.

The previous results show that, for a video sequence with medium amount of motion activity and high spatial

detail complexity such as the Mobile video sequence, the proposed coding solution can bring an average PSNR

improvement of 0.68 dB and an average bitrate saving of 5.8% for the reconstructed prediction error over the

DCT. With all results of the three selected CIF resolution video sequences presented and analyzed, the following

conclusions may be taken regarding the adopted coding solution for this low resolution:

The codec using the adaptive transform with HRS can achieve an average objective prediction error

quality improvement of 0.44 dB and average bitrate savings of 5% over the codec only using the DCT.

These results cannot be directly compared to those in [15], since for this case the PSNR is only

measured for the prediction error and the adaptive transform is not fully integrated in a video coding

standard. Still, for the Mobile sequence, which was also coded in [15], the results obtained with the

HEVC data are inferior to those obtained in [15] with the H.264/AVC codec, with 4% of bitrate

savings obtained in the test made with the adopted HEVC based coding solution against the 20%

obtained in [15].

From the performance results for the CIF resolution sequences, it is possible to conclude that the use of

a MKLT with a FRS does not bring any significant improvement to the adaptive transform

performance (while is increased the complexity).

The third adaptive transform mode, the AT HFRS, which combines the HRS and FRS modes, can

achieve an average prediction error PSNR improvement of 0.61 dB and average bitrate savings of 7%

over the DCT. With these results, it is possible to state that the introduction of a FRS mode can

improve the adaptive transform performance when using both the available shift and rotation

parameters sets.

Finally, it has to be noted that the TU partitioning used in all the performed tests is decided by the HEVC

encoder in a RD optimization sense. In this way, the decision is made based on the performance obtained with

the HEVC DCT solution. Thus, it remains to be seen if the effective integration of the adaptive transform in the

HEVC codec would cause a different TU partitioning influenced by the use of the MKLT.

76

Table 5.10 – Mobile sequence percentage of TUs coded with the available transforms for each AT codec, QP

and TU block size.

Codec Selected

transform

TU sizes

4×4 8×8 16×16 32×32

QP = 16

AT HRS DCT 53% 71% 90% 99%

MKLT HRS 47% 29% 10% 1%

AT FRS DCT 53% 70% 89% 99%

MKLT FRS 47% 30% 11% 1%

AT HFRS

DCT 38% 58% 85% 98%

MKLT HRS 28% 19% 7% 1%

MKLT FRS 34% 23% 8% 1%

QP = 22

AT HRS DCT 52% 63% 80% 95%

MKLT HRS 48% 37% 20% 5%

AT FRS DCT 51% 62% 79% 94%

MKLT FRS 49% 38% 21% 6%

AT HFRS

DCT 37% 50% 73% 93%

MKLT HRS 27% 22% 11% 3%

MKLT FRS 36% 28% 16% 4%

QP = 27

AT HRS DCT 51% 57% 69% 81%

MKLT HRS 49% 43% 31% 19%

AT FRS DCT 50% 57% 68% 80%

MKLT FRS 50% 43% 32% 20%

AT HFRS

DCT 36% 44% 59% 74%

MKLT HRS 26% 24% 18% 11%

MKLT FRS 38% 32% 24% 14%

QP = 32

AT HRS DCT 50% 52% 58% 56%

MKLT HRS 50% 48% 42% 44%

AT FRS DCT 50% 52% 56% 55%

MKLT FRS 50% 48% 44% 45%

AT HFRS

DCT 37% 38% 45% 44%

MKLT HRS 24% 25% 23% 26%

MKLT FRS 39% 36% 31% 30%

QP = 37

AT HRS DCT 49% 45% 45% 29%

MKLT HRS 51% 55% 55% 71%

AT FRS DCT 48% 45% 42% 28%

MKLT FRS 52% 55% 58% 72%

AT HFRS

DCT 36% 32% 31% 19%

MKLT HRS 21% 27% 29% 37%

MKLT FRS 43% 41% 40% 44%

77

5.2.2. Performance for HD Resolution Video Sequences

After the presentation and analysis of the performance results obtained for the CIF resolution video sequences,

the selected HD resolution video sequence (Kimono sequence) performance is now analyzed. To start with,

Figure 5.9 presentes the RD performance of the Kimono sequence for the DCT, MKLT HRS and MKLT FRS

individual transform solutions.

Figure 5.9 – Kimono sequence RD performance for the DCT, MKLT HRS and MKLT FRS.

From Figure 5.9, the following analysis can be made:

In comparison to the DCT RD curve, both MKLTs clearly offer worse RD performance. This

performance difference is more noticeable for the bitrates defined by the QP of 22 and isattenuated for

the lower bitrates (i.e. QP values of 32 and 37). In comparison to the CIF resolution results, the

performance losses shown here seem to be subjectively greater.

Comparing the RD performance of both MKLTs, it is possible to observe that the MKLT FRS achieves

a better coding performance, specially for the higher bitrates. In comparison to the CIF resolution

results, the difference between the MKLT FRS and the MKLT HRS performance seems to be

significantly higher. It remains to be seen if this difference can bring coding improvements to the AT

FRS over the AT HRS solutions, something that was not achieved with the CIF resolution sequences.

Following the method used to present the CIF resolution results, Figure 5.10 shows the RD performance for the

DCT, AT HRS, AT FRS and AT HFRS codecs with the Kimono sequence while Table 5.11 shows the

Bjontegaard metric results of the three adaptive transform modes against the DCT for the same HD sequence.

78

Figure 5.10 – Kimono sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS.

Table 5.11 – Kimono sequence average PSNR improvements and average bitrate savings for each AT mode

against the DCT.






From Figure 5.10 and Table 5.11, it is possible to derive the following conclusions:

In comparison to the codec only making use of the DCT, the AT codecs only achieve RD performance

gains for the lower bitrates (approximately below 20 Mbit/s). For the rest of the bitrate range, the AT

codecs have a very similar behaviour to the DCT based codec, being outperformed between the QP

values of 27 and 22, but achieving little coding gains for the higher bitrates. However, in terms of the

Bjontegaard metric, this is the sequence achieving the best average results in terms of prediction error

PSNR improvement (1.89 dB for the AT HFRS) and bitrate savings (16.0% for the AT HFRS) over the

DCT. Although the RD curves do not seem to show these substantial gains, they result from the fitting

process performed by the Bjontegaard metric computation for the very low bitrates.

Once again, the AT HFRS is the best adaptive transform solution available. However, in this case, the

AT HRS outperforms the AT FRS (with 0.55 dB of average PSNR improvement and 2.8% of average

bitrate savings). This comes as a surprise, notably taking into account the results in Figure 5.9;

however, as referred before, the presented MKLTs RD curves represent the average PSNR using the

same transform for all the inter-coded TUs. Clearly, the MKLT HRS, despite having a worse behaviour

in general when compared to the MKLT FRS, can provide a better coding efficiency for some particular

TUs and this is what determines the later performance.

79

The Kimono sequence percentage of inter-coded TUs for each QP and TU block size results are presented in

Table 5.12, while the percentage of TUs coded with the available transforms for each AT codec, QP and TU

block size for the same sequence are presented in Table 5.13.

Table 5.12 – Kimono sequence percentage of inter-coded TUs for each QP and TU block size.

QP TU sizes

4×4 8×8 16×16 32×32

16 49% 35% 13% 3%

22 5% 38% 39% 18%

27 2% 30% 44% 24%

32 0% 22% 47% 30%

37 0% 15% 45% 40%

From Table 5.12, it is possible to observe that, for this particular HD resolution video sequence, the percentage

of 4×4 TUs used for QP values of 22, 27, 32 and 37 is almost insignificant. On the other hand, the percentage of

large TUs (16×16 and 32×32) used for these QPs is considerably higher than the same average results for the

CIF resolution video sequences. This proves that, for this particular type of sequence, the HEVC takes advantage

of the larger homogenous areas existing in the HD video sequences, partitioning each frame using larger coding

blocks.

From Table 5.13, the following conclusions can be taken:

Like for the CIF resolution sequences, the selection of the MKLTs increases with the QP for all TU

block sizes. In this case, it is possible to observe that the MKLT becomes the most selected transform

from the QP of 27 until the last QP value (37).

Regarding the 4×4 TUs, it is only important to analyze the results obtained for a QP of 16, since this

is the only QP value where this type of TUs show a significant utilization percentage. In this case, the

DCT is selected in 57%, 56% and 42% of the cases for the AT HRS, AT FRS and AT HFRS,

respectively. For the 8×8 TUs, the DCT is selected, on average, 67% of the cases, for all the codecs in

the first two QP values (16 and 22); however, for the remaining QPs (27, 32 and 37), an average of

76% of the cases are coded with the MKLT.

Once again, the larger blocks seem to be better coded with the DCT for lower QPs (with 95% of the

16×16 and 32×32 TUs being coded with this transform for a QP of 16); however, for the higher QPs,

this pattern is totally reversed, with the majority of the larger TUs being coded with the MKLT (with

92% of the 16×16 and 32×32 TUs being coded with the MKLTs for a QP of 37).

The RD performance differences between the AT HRS and the AT FRS noted before (occurring for

the lower bitrates, i.e. higher QP values) seem to be due to the slightly higher MKLT HRS use in the

AT HRS codec in relation to the MKLT FRS use in the AT FRS codec for the 8×8 and 16×16 TUs

and for QP values of 32 and 37.

In conclusion, for the HD resolution video sequence Kimono, which has similar motion activity and spatial

details characteristics to the CIF resolution sequence offering the better adaptive transform RD performance over

the DCT (Container sequence), the proposed adaptive transform can achieve a average prediction error PSNR

improvement of 1.85 dB and average bitrate saving of 15.9% over the DCT using the mode making use of the

three available transforms (DCT, MKLT HRS and MKLT FRS). In this particular case, the performance gain of

this last codec in relation to the AT codec using only the HRS mode is not as significant as observed for the CIF

resolution sequences. In this way, it remains to be seen if the complexity increase caused by the computation of

the MKLT basis function with a set of 405 prediction error blocks, instead of the 75 blocks used with the HRS,

is worthwhile in relation to the performance gains achieved. This issue is even more relevant for the HD

sequences, as they tend to use larger TUs which require a larger computational effort.

80

Table 5.13 – Kimono sequence percentage of TUs coded with the available transforms for each AT code, QP

and TU block size.

Codec Selected

transform

TU sizes

4×4 8×8 16×16 32×32

QP = 16

AT HRS DCT 57% 76% 94% 98%

MKLT HRS 43% 24% 6% 2%

AT FRS DCT 56% 75% 94% 97%

MKLT FRS 44% 25% 6% 3%

AT HFRS

DCT 42% 65% 90% 96%

MKLT HRS 25% 16% 5% 2%

MKLT FRS 32% 19% 5% 2%

QP = 22

AT HRS DCT 57% 67% 83% 86%

MKLT HRS 43% 33% 17% 14%

AT FRS DCT 55% 66% 82% 86%

MKLT FRS 45% 34% 18% 14%

AT HFRS

DCT 44% 55% 76% 82%

MKLT HRS 19% 19% 11% 8%

MKLT FRS 37% 26% 13% 9%

QP = 27

AT HRS DCT 49% 41% 51% 55%

MKLT HRS 51% 59% 49% 45%

AT FRS DCT 49% 42% 51% 55%

MKLT FRS 51% 58% 49% 45%

AT HFRS

DCT 39% 32% 43% 52%

MKLT HRS 17% 20% 25% 22%

MKLT FRS 43% 47% 32% 27%

QP = 32

AT HRS DCT 47% 24% 23% 22%

MKLT HRS 53% 76% 77% 78%

AT FRS DCT 43% 25% 25% 22%

MKLT FRS 57% 75% 75% 78%

AT HFRS

DCT 35% 18% 19% 20%

MKLT HRS 13% 15% 29% 34%

MKLT FRS 52% 66% 52% 46%

QP = 37

AT HRS DCT 39% 13% 10% 6%

MKLT HRS 61% 87% 90% 94%

AT FRS DCT 39% 15% 13% 6%

MKLT FRS 61% 85% 87% 94%

AT HFRS

DCT 30% 10% 8% 5%

MKLT HRS 9% 11% 27% 35%

MKLT FRS 61% 79% 66% 60%

81

5.3. Summary

In this chapter, the proposed adaptive transform has been tested to evaluate its performance against the DCT. To

do this, three different adaptive transforms were used besides the usual DCT: one using a Half Range shift and

rotation parameters Set (HRS) to compute the MKLT basis functions (with maximum δ = 0.5 pixel and θ = 0.5°

as used in [15]), another using a Full Range shift and rotation parameters Set (FRS) to compute the MKLT basis

functions (introducing a maximum δ = 1 pixel and θ = 1°) and a final adaptive transform than can use both the

HRS and FRS modes to compute the MKLT basis functions.

The performance tests were made using two types of video sequence resolutions: CIF and HD. The first type was

used to test the proposed adaptive transform using similar sequences to those tested in [15], which used the

H.264/AVC codec. As the HEVC codec is being developed with the high and ultra high definition video

contents in mind, one HD resolution video sequence was tested, to assess its performance benefits in comparison

to the lower resolution video sequences.

The obtained results have shown that the proposed adaptive transform using a combination of the HRS and FRS

modes can achieve a 0.61 dB gain of objective prediction error and 7% bitrate savings for the CIF sequences,

always on average and over the DCT. For the other two adaptive transforms, the average results are very similar,

with a prediction error PSNR improvement of 0.44 dB and bitrate savings of 5% over the DCT. This results

show that the use of an additional FRS mode does not bring any compression improvements when used alone,

but can bring approximately 0.2 dB of average PSNR improvement and 2% of average bitrate savings when used

in combination with the HRS mode.

For the HD resolution video sequence, the obtained results revealed reasonably higher coding gains that those

obtained for the CIF sequences, although these gains are only verified for the low bitrate values. In this way, the

adaptive transform using both the HRS and FRS modes was able to achieve 1.89 dB better prediction error

objective quality and 16.0% bitrate saving in relation to the DCT, always on average. In this case, the adaptive

transform only using the HRS mode to compute the MKLT basis functions clearly outperformed the adaptive

transform using the FRS mode. In this way, the adopted coding solution with an adaptive transform as proposed

in [15] could achieve a prediction error PSNR improvement of 1.67 dB and bitrate savings of 14.7% over the

DCT, always on average. Since the use of a FRS mode introduces a significant complexity increase in the video

codec (as it uses 5.4 times more estimated prediction error blocks), the similarity of the results with only the

HRS mode and with both the HRS and FRS modes indicate that the use of FRS may not be useful for HD

resolution video sequences.

83

Chapter 6

Conclusions and Future Work

This chapter concludes this Thesis report by presenting a brief summary of what was presented in each of the

previous chapters. Additionally, some conclusions are taken in relation to this Thesis initially defined objectives.

Finally, some future work ideas are presented.

6.1. Summary and Conclusions

The first chapter of this report introduced the reader to the context in which this work is relevant as well as to the

emerging problem that is asking for a solution: the efficient compression of HD and UHD content. Besides this,

the objectives of this Thesis were also defined.

Chapter 2 introduced the basic principles and concepts about transform coding. Additionally, the most important

transforms in the signal processing context were reviewed, namely the DCT and the KLT used in the developed

solution.

Prior to the actual presentation of the adopted coding solution, the two main background technical elements were

introduced. In this way, the video coding solution proposed by Biswas et al. in [15] was first described in detail,

with a natural focus on the proposed adaptive transform, as it would serve as the base for the adaptive transform

used in this Thesis solution. Then, the emerging HEVC standard was also presented. This under development

standard intends to become the next state-of-the-art video coding standard and targets to reduce by half the

bitrate currently needed to code a particular video sequence with a specific quality using the H.264/AVC

standard.

With Chapter 4, the adopted coding solution is finally presented to the reader. This presentation includes the

functional description of each of its coding processes (encoder and decoder) and also a functional description of

the HEVC framework used to extract data processed by the HEVC codec. Besides these functional descriptions,

the implementation details are also described, focusing on the developed MATLAB script used to implement the

proposed adaptive transform.

84

Finally, in Chapter 5, a performance evaluation of the adopted coding solution is made. This evaluation is

performed using three CIF sequences and one HD sequence coded with the adopted coding solution and

comparing the obtained RD results with those obtained using the popular DCT. With this, it was possible to

conclude that the adaptive transform can achieve encouraging bitrate savings over the DCT, particularly for the

tested HD sequence.

In summary, it can be said that the objectives defined in Chapter 1 were achieved. Thus, a recent advance related

to transform coding was studied, implemented and assessed in the context of the HEVC standard. Although the

integration of the studied adaptive transform in the HEVC standard was not fully accomplished (for the reasons

explained in Chapter 4), it was possible to extract the necessary data to simulate as much as possible a full

integration scenario. With this, a performance evaluation of a video coding solution including the adopted

adaptive transform was successfully made, showing positive results when compared to the currently used

transform coding tools.

6.2. Future Work

Clearly, the first improvement that can be made to the coding solution developed in this Thesis is related to the

full integration of the used adaptive transform in the HEVC codec. Future releases of new software versions of

this codec tend to become more legible and organized from a programmer point of view. In this way, the

proposed adaptive transform should be fully integrated in HEVC to allow a more complete and accurate

evaluation of the performance changes introduced by the adaptive transform. A full integration of the adaptive

transform in the HEVC codec would allow the following evaluation improvements regarding the work

developed in this Thesis:

Frame partitioning – By integrating the adaptive transform in the HEVC codec, it would be possible

for the encoder to make the frame partitioning in a RD optimization sense using not only the DCT (as

made on this Thesis), but also the proposed MKLT.

Reference frame – All the reference frames used in the coding solution developed in this Thesis are

obtained from previous codings made with the HEVC codec. With a fully integrated model, these

reference frames would also reflect previous codings using the proposed adaptive transform.

Quantization and entropy coding – As referred in Chapter 4, the quantization and the entropy coder

used in the adopted coding solution are not the same as the ones currently present in the HEVC codec.

In this way, the use of the actual coding tools used by the HEVC would allow a more accurate

evaluation of the performance results.

Other improvements – With the integration of the adaptive transform in the HEVC codec, it would be

possible to use other test conditions not used in this Thesis due to the necessary implementation

simplifications. For example, it would be possible to use B-frames and multiple reference frames.

In the case of obtaining positive RD performance gains with the fully integrated coding solution proposed

before, then the next step should target the study of the computational complexity associated to this solution,

which was not considered in this Thesis. This would be important to evaluate the trade-off between the

additional complexity and the coding gains associated to the adaptive transform and to possibly develop new

algorithms allowing a faster computation of the encoding and decoding process.

85

Appendix A

Transforms in Available Image/Video

Coding Standards

All the available image and video coding standards make use of transform tools in their coding architecture. To

have an idea on the used transforms and their details, the available coding standards are briefly reviewed in the

following with particular emphasis on the transform related aspects. Besides the transform details, this appendix

also contains a brief review of the objectives, main features, technical improvements and performance of each

standard. The first two standards reviewed – JPEG and JPEG 2000 – are image coding standards; the following

standards – H.261, MPEG-1 Video, MPEG-2 Video, H.263, MPEG-4 Visual and H.264/AVC – are all video

coding standards.

A.1. JPEG Standard

The JPEG image coding standard was defined in 1992 by the Joint Photographic Experts Group (JPEG) [42]. It

is formally known as Recommendation ITU-T T.81 and ISO/IEC 10918-1 standard. This standard specifies two

classes of encoding and decoding processes: lossy and lossless. For this review, only the lossy class is

considerer, since it is the only one using transform coding. This class is known as the JPEG Baseline Sequential

process and it is the most used JPEG coding solution.

A.1.1. Objectives

The objective of this standard is to define a generic compression standard for multilevel photographic images. Its

main requirements are:

Efficiency – It must be based on the most efficient compression techniques available, in order to use the

smallest possible amount of bits for a particular target quality.

Adjustable compression/quality – The level of compression must be adjustable, allowing a selectable

trade-off between number of bits used and image quality obtained.

86

Generic – It must be applicable to all kinds of multilevel photographic images, independently of their

resolution, aspect ratio, etc.

Low complexity – It must be implemented with reasonably low complexity, in order to allow its

implementation on a wide range of platforms and applications.

With these requirements, JPEG is designed to be used in a wide range of applications, e.g., digital photography,

color facsimile, medical and scientific images, etc.

A.1.2. Technical Approach and Architecture

The JPEG coding process adopted a DCT based image coding architecture which is presented in Figure A.1.

Figure A.1 – JPEG encoder architecture [42].

A short walkthrough of the encoding process is presented next:

1. Block splitting – The original image is divided into 8×8 samples blocks. If the input data does not

represent an integer number of blocks, then the encoder must fill the incomplete blocks with some

dummy data.

2. Forward DCT – Each 8×8 block is then transformed using a 2-D forward DCT, resulting in a set of

8×8 (64) DCT coefficients.

3. Quantization – Each of the 64 coefficients is then quantized using a specific quantization matrix.

4. Entropy encoder – After quantization, the quantized-DCT coefficients are arranged into a one-

dimensional zigzag sequence (see Figure A.2). Using this sequence ensures that the encoder will

encounter all non-zero DCT coefficients in the block as early as possible. Moreover, since this zigzag

ordering roughly corresponds to the coefficients perceptual relevance, its usage guarantees that more

perceptually important coefficient are always transmitted before less perceptually important

coefficients. The next step is to create a (run, level) pair for each coefficient. The run is the number of

null DCT coefficients preceding the coefficient being coded in the zigzag sequence. The level is the

quantized amplitude of the coefficient to be coded. The run and the number of bits used to code the

level (size) are then encoded using Huffman tables and the level is encoded using a Variable Length

Integer (VLI) code. To better exploit the spatial correlation, the DC coefficient of each block is coded

as the difference to the DC coefficient of the previous neighbor block.

Figure A.2 – Zigzag sequencing for the DCT coefficients within a block in JPEG [42].

87

The decoding process is essentially the inverse of the encoding process. The entropy decoder decodes the zigzag

sequence of quantized DCT coefficients and then, after the inverse quantization process, the DCT coefficients

are transformed to an 8×8 block of samples by the inverse DCT. If the inverse DCT implementation is not fully

specified, there may exist some mismatches regarding the original image due to truncations and roundings in the

finite arithmetic implementations.

A.1.3. Transform and Quantization

As mentioned above, the JPEG Baseline Sequential mode uses a 2-D DCT. This transform is unitary (and

orthogonal) and separable and is given by

(A.1)

where:

y(k,l) is the DCT coefficient at coordinates (k,l)

x(m,n) is the sample value – luminance or chrominances - at coordinates (m,n)

The quantization matrices are not standardized, but JPEG suggests a quantization matrix using values

corresponding to the minimum perceptual differences for each DCT coefficient; this basic quantization matrix

may be used to generate „lower quality‟ quantization matrixes by multiplying this matrix by a certain integer

quantization factor. Considering the HVS characteristics, the quantization steps used are typically lower for the

lower frequencies and vice-versa. In this way, more quantization noise is injected in the less perceptually

relevant frequencies, the higher frequency coefficients; this is very important to exploit the signal irrelevance,

this means avoiding to transmit image information that cannot be visually perceived. The quantization matrices

have to be transmitted or signaled in the case the suggested quantization matrix is used.

A.1.4. Performance Evaluation

The quality of the JPEG decoded images greatly depends on the quantization steps used for the encoding

process. For higher quantization steps, the compression ratio will increase, but the quality of the reconstructed

image will suffer from the data reduction; this means less coefficients are coded or the same coefficients are

coded but with more quantization noise. It is important to understand that the compression performance of a

JPEG encoder will strongly depend on the choices made by the encoder in terms of which coefficients are coded

and which quantization steps are used for each coded coefficient. For example, in Figure A.3, the same image is

encoded with a small quantization step (left side) and with a big quantization step (right side). Despite the

greater compression ratio achieved for the right side image in Figure A.3, it has very low quality when compared

to the left side image, with extreme loss of color and detail. The image coded using large quantization steps

shows very well the typical coding artifact resulting from a block based transform coding solution like JPEG: the

block effect. Since the image is coded as (artificially) independent blocks, with the exception of the DC

coefficient prediction, when the number of bits per block is reduced, fewer coefficients are sent and more

quantization noise is inserted, boosting the impact of the block boundaries; this is very evident for some blocks

of the right side image in Figure A.3 where only the DC coefficient is transmitted.

88

Figure A.3 – Image coded with JPEG using small quantization steps (compression ratio is 2.6:1) on the left side

and using large quantization steps (compression ratio is 144:1) on the right side [43].

The compression ratio achieved for a specific image will also depend of its particular characteristics, e.g., for

highly detailed images there isn‟t much spatial redundancy to exploit, thus, the amount of data required to

represent these images can‟t be as reduced as for smoother, lower frequency images. For example in [14], it is

stated that transparent quality may be typically reached at about 1.5-2 bit/pixel while a medium to good quality,

enough for some applications, may be reached at about 0.25-0.5 bit/pixel.

A.2. JPEG 2000 Standard

JPEG 2000 is another image coding standard created by the JPEG committee around 2000 [44], this means more

than 10 years after the JPEG standard. Officially, JPEG 2000 corresponds to the ISO/IEC International Standard

15444-1.

A.2.1. Objectives

The JPEG 2000 standard was created with the objective of providing improved compression performance and

subjective image quality when compared to the existing standard from the same standardization body, the JPEG

standard. It was also intended to be more flexible than the JPEG standard, being suitable for different types of

still images (e.g. bilevel, grayscale, color, etc), with different characteristics (e.g. natural, computer generated,

medical, text, etc) and with different imaging models (e.g. real-time transmission, image library archival, limited

bandwidth resources, etc), that is, suitable for a wide number of applications, e.g. Internet, color facsimile,

printing, scanning, digital photography, medical imagery, E-commerce, etc. To fulfill these goals, the JPEG

2000 was created with a number of requirements in mind, mainly:

Good compression performance at low bitrates;

Lossless and lossy compression;

Progressive transmission by quality, resolution, component and spatial locality (i.e. scalability);

Random (spatial) access to the bitstream;

Robustness to bit-errors.

Besides the improvement of the compression performance and quality when compared to the JPEG standard,

JPEG 2000 defined a very important new objective: scalability. Thus, JPEG 2000 is defined in such a way to

allow the extraction of different resolutions, pixel fidelities, SNR and visual quality, and more, all from a single

compressed bit-stream. With this feature, is possible to use this standard for any target device, transmitting only

the essential or possible data.


The JPEG 2000 encoder architecture is illustrated in Figure A.4. Before proceeding with the walkthrough of the

encoding process illustrated in Figure A.4, it should be noted that each image may be coded as a whole or

divided in tiles. Tiles are rectangular non-overlapping areas that are compressed independently, as they were

entirely distinct images; most times there is a single tile meaning the full image is a tile.

89

Figure A.4 – JPEG 2000 encoder architecture [45].

A short walkthrough of the encoding process is presented next:

1. Forward DWT – First, each tile (or the whole image) is transformed using a 2-D DWT. With this

transform, the image components, e.g. typically luminance and chrominances, are decomposed into

different resolution levels. These decomposition levels are made up of sub-bands populated with DWT

coefficients describing the frequency characteristics of local areas of each image component.

2. Quantization – The DWT coefficients are after quantized. This quantization process is described with

more detail in the next section.

3. Entropy encoder – Then, each sub-band of the DWT decomposition is divided up into regular non-

overlapping rectangular blocks, called code-blocks. Entropy coding is performed independently on each

code-block, bitplane by bitplane. Bitplanes are binary arrays representing a code-block from its Most

Significant Bit (MSB) to its Less Significant Bit (LSB), as shown in Figure A.5. Each individual

bitplane is coded with Context-based Adaptive Binary Arithmetic Coding (CABAC), resulting in

compressed bit-streams for each code-block.

Figure A.5 – Example of a bitplane from a particular code-block [14].

4. Bit-stream organization – In this step, the compressed bit-streams are organized in packets. Each

packet can be interpreted as one quality increment, for one resolution level, at one spatial location.

These packets can also be grouped in layers where each layer can be interpreted as one quality

increment for the entire image at full resolution.

With the utilization of a wavelet transform and the organization of the codestream as described above, JPEG

2000 assures quality and spatial resolution scalability.


As noted above, the JPEG 2000 standard uses a 2-D DWT. This transform can be:

Irreversible – The default irreversible transform is implemented by means of the Daubechies 9/7 filter;

this filter is used for lossy coding. The analysis and the corresponding synthesis filter coefficients are

given in Table A.1.

90

Table A.1 – Irreversible Daubechies 9/7 analysis and synthesis filter coefficients [45].

Reversible – The default reversible transform is implemented by means of the 5/3 filter, which

coefficients are given in Table A.2; this filter is used for lossless coding.

Table A.2 – Reversible 5/3 analysis and synthesis filter coefficients [45].

Figure A.6 shows an example of the DWT used in JPEG 2000. In this case, a two-level DWT using a

Daubechies 9/7 filter is shown.

Figure A.6 – Example of a 3-levels DWT decomposition as used in JPEG 2000 [46].

From the observation of Figure A.6, it is possible to identify the various DWT decompositions with each level

providing more data to the final image, allowing more resolution.

After the transformation, the DWT coefficients are subject to uniform scalar quantization, employing a fixed

dead-zone around the origin. This is accomplished by dividing the magnitude of each coefficient by a

quantization step size and rounding down. One quantization step size is allowed per sub-band. The standard does

not define any method for the step size selection, so several methods can be used at will. A possible way to

select the quantization steps is related to the visual importance of each sub-band coefficients for the final image

quality, selecting bigger step sizes for less important coefficients and vice-versa.


To check if the initially defined goals were achieved, JPEG 2000 performance will be here compared with the

previous JPEG performance. The superiority of JPEG 2000 can be subjectively judged with the help of Figure

A.7, where part of the reconstructed image Woman is shown after compression at 0.125 bpp (bits per pixel), and

Figure A.8, which shows the same result after compression at 0.25 bpp.

91

Figure A.7 – Reconstructed images compressed at 0.125 bpp by means of (a) JPEG and (b) JPEG 2000 [47].

Figure A.8 – Reconstructed images compressed at 0.25 bpp by means of (a) JPEG and (b) JPEG 2000 [47].

For the lower bitrates, the quality of the reconstructed images using JPEG 2000 is clearly better than using

JPEG, as shown in Figure A.7 and Figure A.8, since JPEG 2000 does not suffer from the block effect. As the

bitrate increases, the JPEG 2000 performance superiority decreases also because the block effect tends to

disappear. Visual comparisons of JPEG compressed images and JPEG 2000 compressed images show that, for a

large category of images, JPEG 2000 file sizes are on average 11% smaller than JPEG at 1.0 bpp, 18% smaller at

0.75 bpp, 36% smaller at 0.5 bpp and 53% smaller at 0.25 bpp [47]. However, even though JPEG 2000 can

achieve higher compression ratios for the same quality when compared to JPEG, this comes at the price of

additional complexity [48], which can be perceived as a drawback for some applications requiring low

complexity coding. For these applications, JPEG may still be the best solution.

A.3. H.261 Recommendation

H.261 is a 1990 video coding standard developed by the VCEG of the ITU-T [49]. It is officially known as

Recommendation H.2614 and was the first international video coding standard with relevant market adoption.

A.3.1. Objectives

This standard was designed for videotelephony and videoconference applications over Integrated Services

Digital Network (ISND) telephone lines. The ISDN lines typically have bitrates that are multiples of 64 kbit/s

(p×64 kbit/s). H.261 operates at bitrates between 40 kbit/s and 2 Mbit/s and supports QCIF (176×144 pixels)

and, optionally, CIF (352×288 pixels) spatial resolutions at 4:2:0 subsampling (each chrominance is subsampled

by a factor of 2, both horizontally and vertically). The coding algorithm operates over progressive content at 30

frames/s but this frame-rate can be reduced by skipping 1, 2 or 3 frames for each transmitted one.

Because of its target applications, this standard has critical delay requirements, in order to allow a normal

bidirectional conversation. On the other hand, its quality requirements are not so critical since, in this case, a

lower or intermediate quality may be enough for a good personal communication.

4 Formally speaking, ITU issues recommendations and ISO/IEC issues standards.

92


To achieve high compression efficiency, video coding solutions have to exploit the spatial redundancy, typically

using a transform, the temporal redundancy, typically making some prediction in time, and the statistical

redundancy, typically through entropy coding. This would result in a lossless video coding solution. However,

since a lossless video coding solution would not achieve the necessary compression factors, video coding

solutions also exploit the visual irrelevancy to eliminate, through quantization, all the information which is not

perceptually relevant; this would result in transparent quality (perceptually similar to the original quality). If

higher compression factors are necessary, the encoder may also eliminate relevant information, thus implying

there is some quality degradation regarding the original quality (although this should happen in the most

graceful way possible).

The basic units for H.261 video coding are the macroblocks (MBs). Each macroblock corresponds to 16×16

luminance samples. In H.261, there are two main ways of coding each macroblock:

Intra-coding – These macroblocks are basically coded using the same techniques used in JPEG, which

are applied to the macroblock. In this case, no temporal redundancy is exploited. Intra-coding is

mainly used for the first picture, for later pictures after a change of scene and also for the macroblocks

corresponding to novel „objects‟ in the scene. For the intra-coded macroblocks, the encoding process

has the following steps (illustrated in Figure A.9):

o Forward DCT – The macroblock is divided in 8×8 blocks, which are transformed using a 2-

D forward DCT.

o Quantization – The resulting DCT coefficients are then quantized.

o Entropy encoder - All quantized coefficients are then ordered in a 1-D zigzag sequence.

Each coefficient is represented using a bi-dimensional symbol, (run, level), where its position

and quantization level are indicated. To exploit the statistical redundancy, these symbols are

then coded using Huffman coding.

Figure A.9 – Basic H.261 intra-encoding architecture [50].

Inter-coding – With this coding mode, it is possible to use information from previous frames to code

the current frame, taking advantage of the temporal redundancy between neighbor frames. Moreover,

this coding mode can also detect, estimate and compensate the motion in the sequence, making much

improved temporal predictions, thus reducing the prediction error. Inter-coding is used in sequences of

similar pictures, including those containing moving objects. For the inter-coded macroblocks, the

encoding process considers the following steps (illustrated in Figure A.10):

o Motion estimation – To assess the existence of motion, the current macroblock is compared

with the macroblocks in the neighborhood of the corresponding macroblock in the previous

frame. If motion is detected, its horizontal and vertical directions are stored in two integers,

the motion vector components. The motion vectors (MV) are then entropy encoded. Although

very important to increase the compression efficiency, motion estimation is implies a very

high computational effort.

o Sending the differences – If there is motion estimated in the previous step, the difference

between the current macroblock and the prediction macroblock is computed performing the

so-called motion compensation. Otherwise, the difference (prediction error) is computed

between the current macroblock and the corresponding macroblock in the previous frame.

93

These differences, which should ideally be as small as possible, are then transformed,

quantized and entropy encoded.

Figure A.10 – Basic H.261 inter-encoding architecture [50].

It is important to note that for both coding modes – intra and inter - the encoder as to perform the corresponding

decoding process in order to store the decoded information for future inter-coding. The prediction process may

be modified by a loop filter (LF) that can be switched on and off to improve the picture quality by removing

high-frequency noise when needed.


The transform used in the H.261 standard is very similar to the one used in JPEG. It is a 2-D separable DCT of

size 8×8. Before the computation of the transform, the data range is also arranged to be centered on zero this

means a subtraction of 128 is applied to the samples in the 0-255 ranges for 8 bits samples.

H.261 can use as quantization steps all even values between 2 and 62. Within each macroblock, all DCT

coefficients are quantized with the same quantization step with the exception of the DC coefficient for intra-

coded macroblocks, which are always quantized with step 8, due to their critical perceptual relevance.


As the first video coding standard, H.261 does not have any previous video coding standard to be compared

with. Still, it is possible to evaluate its performance depending on the available bitrate, the characteristics of the

video sequences and, very important, the used encoding tools. For example, Figure A.11 shows the image

quality (using the PSNR as quality metric) versus the bitrate for the well know videotelephony sequence Miss

America with QCIF resolution; the chart shows RD performance results for the sequence coded at 30 frames/s

and at 10 frames/s; moreover, results are shown with and without motion vectors, and with and without a low

frequency loop filter when motion vectors are used (+MV+LF).

94

Figure A.11 – Average PSNR (dB) versus bitrate (kbit/s) for various H.261 combinations of tools for the Miss

America sequence [51].

Observing Figure A.11, it is clear that the image quality depends greatly on the available bitrate; as expected, for

the lower bitrates, the average PSNR is lower than for higher bitrates. For a certain bitrate, the video sequence at

10 frames/s has, on average, more bits per frame than the video sequence at 30 frames/s. Thus, the average

PSNR for the video sequence with the lower frame rate is typically higher although the motion impression may

not be as good if the sequence has more intense motion. Since the motion estimation process is lossless, using it

reduced the prediction error and increases the average PSNR as bits are saved to reduce the quantization step

applied to the coefficients of the inter-coded and intra-coded macroblocks, if a certain, fixed bitrate is used.

In Figure A.12, the PSNR variation against the compression ratio is shown for the same video sequence. With

the increase of the compression ratio, the number of bits available to represent the video sequence decreases; this

results in a reduction of the average PSNR value and a consequent degradation of the image quality.

Figure A.12 – Average PSNR (dB) versus compression ratio for various H.261 combinations of tools for the

Miss America sequence [51].

Analyzing both charts, it is possible to conclude that the introduction of motion compensation and a loop filter

always improves the quality of the reconstructed video sequence for all the bitrates and compression ratios,

95

naturally at the price of some additional computational complexity. The improvements are more noticeable for

the lower bitrates and higher compression ratios.

A.4. MPEG-1 Video Standard

The MPEG-1 Video standard was the first video coding standard defined by the MPEG. It was finalized around

1993 and it is formally known as ISO/IEC 11172-2 [52].

A.4.1. Objectives

The main target of the MPEG-1 standard was to efficiently compress audiovisual information for digital storage,

notably to digitally store a VHS quality audiovisual sequence in a Compact Disc (CD). Thus, the MPEG-1

standard defines video and audio codecs in its associated Video and Audio parts.

For MPEG-1 Video, the target bitrate is around 1.2 Mbit/s to compress CIF resolution at 25 Hz video. Unlike

H.261, MPEG-1 Video does not have critical real-time requirements since the main target are not real-time

applications; however, it has some other critical requirements related to digital video storage, such as random

access, to provide the typical storage functionalities, such as fast forward and reverse playback, edition, etc. This

standard was originally optimized for the SIF, which has 352×288 pixels at 25 Hz and 352×240 for 30 Hz with

4:2:0 subsampling.


Besides the prediction methods used in H.261, where a macroblock can be predicted from a macroblock in the

previous frame (forward prediction), MPEG-1 Video also adopts backward prediction, based on the principle

that a macroblock can be predicted also taking as reference a future frame macroblock. This type of temporal

prediction has its costs, specially in terms of coding delay and complexity, which may be acceptable considering

that real-time applications are not the main target and offline coding is the main application scenario.

Because of the required storage facilities referred before, MPEG-1 Video defines three types of frames

depending on the coding tools used:

Intra-frames (I-frames) – The I-frames include only intra-coded macroblocks. These frames are

mainly used to provide random access since they do not depend on any other frames. They also prevent

error propagation associated to the channel errors, since all the other frames types depend on other

frames and, thus, may propagate their errors.

Inter-frames – The inter-frames may included intra and inter-coded macroblocks. There are two

classes of inter-frames in MPEG-1 Video:

o P-frames – In these frames, the inter-coded macroblocks can only be predicted from

macroblocks from the previous I or P-frame (forward prediction).

o B-frames – The inter-coded macroblocks in B-frames can use forward prediction, backward

prediction or an average of both forward and backward predictions, the so-called bidirectional

prediction. These predictions may only be based on the adjacent I and P-frames. B-frames

typically require fewer bits than any other frame type for a certain quality; however, if too

many B-frames are successively used, the coding delay increases and the compression

efficiency is reduced since the reference frames (I or P) for the B frames will be farther away

and, thus, the prediction error will be higher.

It is important to stress that the typical additional compression efficiency of P-frames regarding I-frames and of

B-frames regarding P-frames is deeply related to the additional complexity associated to the motion estimation

process (with one and two reference frames for P and B frames, respectively) and the additional delay for B

frames.

The MPEG-1 video encoder architecture is presented in Figure A.13.

96

Figure A.13 – Basic MPEG-1 Video encoder architecture [53].

The walkthrough of the architecture shown in Figure A.13 is presented next:

For intra-coded macroblocks

o Forward DCT – After splitting the macroblock in 8×8 blocks, the samples are transformed

using a 2-D forward DCT.

o Quantization – Subsequently, the DCT coefficients are quantized.

o Entropy encoder – Finally, the quantized DCT coefficients are entropy encoded using

Huffman coding.

For inter-coded macroblocks

o Motion estimation – The previous and the future (I or P) prediction frame(s) macroblocks are

compared to the current macroblock. If this operation detects motion, the motion vectors are

entropy encoded. MPEG-1 Video uses half-pixel motion estimation accuracy to allow a more

precise estimation of the motion with the consequent reduction of the prediction error.

o Sending the differences – If there is motion detected, the differences are coded using motion

compensation. Otherwise, they are simply predicted by the relevant prediction frame(s)

corresponding macroblock(s). These differences are then transformed, quantized and entropy

encoded.

This walkthrough is valid for all the standards presented in the next sections; thus, it will not be repeated, and

only relevant differences will be referred.


The MPEG-1 Video standard uses a 2-D separable DCT of size 8×8; this is not different from the transform used

in both the JPEG and H.261 standards.

The quantization process used in MPEG-1 Video is similar to the one used in JPEG. The quantization step may

be different for each DCT coefficient and it is defined with quantization matrices. There are two basic standard

quantization matrices: one for intra-coding and another for inter-coding (see Figure A.14). For inter-coding, the

high frequency coefficients are not necessarily associated to high frequency content since they can result from

block effects in the reference image(s), poor image compensation or camera noise; in this context the

quantization steps are constant. For intra-coding, absolute energies are being coded and, thus, their quantization

should take into account the visual sensitivity to the various spatial frequencies. The quantization matrices may

be changed to achieve a better coding efficiency. Like H.261, the DC coefficients of intra-coded macroblocks

are always quantized with step 8.

97

Figure A.14 – MPEG-1 Video standard quantization matrices [54].

In MPEG-1 Video, the DC coefficients are differentially coded within each macroblock and between neighbor

macroblocks. This is done in order to exploit the similarities between the adjacent blocks DC coefficients.


The technical improvements introduced in MPEG-1 Video bring a significant increase in terms of compression

efficiency when compared to H.261, notably the bidirectional predictions and the half pixel motion accuracy;

this increase typically comes at the cost of some computational complexity and delay. For video storage, these

costs are not as critical as for real-time video communications. Therefore, MPEG-1 Video fulfils its main

objective of providing a powerful video compression solution for video storage.

For less complex sequences and lower bitrates, H.261 typically achieves higher compression ratios than MPEG-

1 Video at comparable qualities since MPEG-1 Video was optimized for bitrates in the range of 1.2 Mbit/s [55].

Thus, for videotelephony and videoconference, which content typically has less complex motion, where lower

bitrates are typically available and lower computational complexity and real-time performance are required, the

H.261 standard may still be the best choice between these two standards. However, for more general video

content, like movies, MPEG-1 Video provides significant compression efficiency advantages at the costs already

mentioned.

A.5. MPEG-2 Video Standard

The MPEG-2 Video standard (MPEG-2 Part 2) was finalized around 1996 in a joint collaborative team where

MPEG and ITU-T joined efforts [56]. It is formally known as ISO/IEC standard 13818-2 and Recommendation

ITU-T H.262.

A.5.1. Objectives

Jointly developed by both the ISO/IEC MPEG and the ITU-T VCEG standardization groups, this was the first

video coding standard created for both broadcasting and storage. MPEG-2 Video is designed to code high

quality and resolution video sequences without noticeable quality loss, notably with the following quality

targets:

Secondary distribution – For broadcasting to the users, the signal quality at 3-5 Mbit/s must be better

or similar than the quality of available analogue systems, i.e. PAL, SECAM and NTSC.

Primary distribution – For contribution (e.g. transmission between studios), the signal quality at 8-10

Mbit/s must be similar to the original quality; this means the quality of the raw PCM representation.

The main MPEG-2 Video target applications are digital television transmission (i.e. cable, satellite and terrestrial

broadcasting) and Digital Video Disc (DVD) storage. Initially, the MPEG-2 Video standard was intended to

cover video coding up to 10 Mbit/s, leaving the still higher bitrates and spatial resolutions for another standard to

be labeled MPEG-3. However, MPEG-3 was never defined since MPEG-2 Video also addressed the HD space

in an efficient way.

Unlike the previously reviewed standards, MPEG-2 Video targets the coding of interlaced video-content,

additionally to the usual progressive video content. This is useful due to historical reasons as analogue TV is

interlaced. Another feature introduced by this standard is scalability (i.e. temporal, spatial and fidelity); this

functionality may be especially useful to accommodate transmissions in heterogeneous networks and to various

types of terminals, e.g. with standard or HD resolution.

98

Because the MPEG-2 Video standard addresses a vast range of applications, the standard and, thus, the

associated tools have been structured in terms of Profiles and Levels. A Profile defines a subset of the coding

tools and, thus, of the bitstream syntax, providing a variety of features required by some applications with a

certain degree of complexity, e.g. interlaced coding, B-frames and scalability. Within each Profile, Levels are

defined to limit the range of operating parameters, such as the spatial resolution (352×288 to 1920×1152) and

bitrate (4 Mbit/s to 80 Mbit/s).


The coding tools used in MPEG-2 Video are very similar to those used in MPEG-1 Video. The two main

differences are related to the two main additional functionalities:

Interlaced coding – With the MPEG-2 Video standard, it is possible to code interlaced video content,

which is the format used by analogue broadcast TV systems.

Scalable coding – The MPEG-2 Video standard allows temporal scalability (i.e. change of frame rate),

spatial scalability (i.e. change of resolution) and fidelity scalability (i.e. change of quality). When

creating scalable bitstreams, a bitrate overhead typically arises when compared to the corresponding

non-scalable streams.

The MPEG-2 video encoder core architecture, this means without scalable coding, is presented in Figure A.15. It

is important to mention that temporal scalability is already provided by the MPEG-1 Video standard, as it is

naturally provided by the I, P and B temporal prediction structure, without any bitrate burden. This means that

the additional scalability capabilities in MPEG-2 Video mainly refer to spatial resolution and quality scalability.

Figure A.15 – Basic MPEG-2 Video encoder architecture [35].


The MPEG-2 Video standard uses the same 2-D DCT used in MPEG-1 Video. For interlaced video content, it is

possible to use an alternate scanning order for the DCT coefficients (shown in Figure A.16). With this alternative

scanning order, the DCT coefficients corresponding to the vertical transitions are privileged in terms of scanning

order, since the vertical correlation is reduced for interlaced pictures with more motion.

MPEG-2 Video uses the same quantization techniques used in MPEG-1 Video, also making use of previously

presented quantization matrices. Once again, the DC coefficients of intra-coded macroblocks are always

quantized with step 8.

99

Figure A.16 – Zigzag and alternate scanning order for interlaced video content [35].


In comparison to MPEG-1 Video, it is clear that MPEG-2 Video can produce better quality for interlaced video

regardless of the motion content [57]. However, for progressive video and for MPEG-1 Video target bitrates

(around 1.2 Mbit/s), MPEG-1 Video outperforms MPEG-2 Video. This is due to the fact that MPEG-2 Video

has a more complicated syntactical structure, which can increase the overhead information burden at lower

bitrates. For higher bitrates (3 Mbit/s and above), even for progressive video, the MPEG-2 Video standard can

achieve improved quality (for the same rate) in comparison to MPEG-1 Video [58], since the later was not

optimized for this type of bitrates. These results are in conformity with the initially defined standard objectives,

allowing very efficient compression for high resolution and quality video.

A.6. H.263 Recommendation

Recommendation H.263 was finalized around 1995 by the ITU-T VCEG standardization group [59]. It is

formally known as ITU-T Recommendation H.263.

A.6.1. Objectives

The H.263 standard was created with the intention of replacing H.261, improving its compression efficiency,

notably for the lower bitrates. The main motivation behind the creation of this standard was the lack of a

standard that could assure interoperability between digital videotelephony terminals for the analogue telephone

network (PSTN) and the emerging mobile networks. This standardization process had to be quick to provide a

fast deployment of interoperable products in the market. Thus, the H.263 standard is mostly based on the

existing technology, particularly the H.261 and the MPEG-1 Video coding tools.


Although H.261 and H.263 share the same basic coding structure, there are some differences between them.

Some of these differences are improvements that were already present in MPEG-1 Video. The main differences

between the H.261 and the H.263 standards are:

Target bitrate – The H.261 target bitrates is p×64 kbit/s (p = 1,2,…,30) whereas H.263 also aims at

bitrates below 64 kbit/s to allow videotelephony over the PSTN.

Picture formats – Besides the formats already used in H.261 (i.e. QCIF and CIF), H.263 also supports

the sub-QCIF, 4CIF and 16CIF formats.

Motion compensation accuracy – Like the MPEG-1 Video standard, H.263 supports half-pixel

accuracy.

Motion vector prediction – Motion vectors are coded differentially as in H.261, but, besides the

preceding macroblock, also macroblocks in the previous macroblock-row are used for motion vector

prediction; this allows increasing the bitstream error resilience.

PB-frames mode – A PB-frame consists of two pictures coded as one unit. The P-frame is predicted

from the last decoded P-frame and the B-frame is predicted from both the last and the current P-frame.

This allows increasing the decoded frame rate at a rather low bitrate cost.

VLC tables – The H.263 uses (run, level, eob) triplets to code the DCT coefficients and not anymore

(run, level) duplets to avoid explicitly coding the eob (End Of Block) symbol.

100

The H.263 encoder architecture is presented in Figure A.17.

Figure A.17 – Basic H.263 encoder architecture [60].


In H.263, the transform used is the same 2-D DCT used in H.261. As usual, this transform is applied to 8×8

blocks.

In terms of quantization, H.263 uses the same method described for H.261. The same step size (even values

between 2 and 62) is used for all the coefficients in the same macroblock, except the DC coefficient of the intra-

coded macroblocks, which is quantized with step 8.


The H.263 standard outperforms the H.261 compression efficiency for any bitrate, even above 64 kbit/s [61].

With this performance, it is possible to say that H.263 successfully replaced H.261 as the video compression

standard for lower bitrate communications. Furthermore, the H.263 complexity is only marginally higher than

the H.261 complexity [61].

A.7. MPEG-4 Visual Standard

The MPEG-4 Visual standard was finalized around 1999 by MPEG [62]. It is also called MPEG-4 Part 2 and it

is formally known as ISO/IEC 14496-2.

A.7.1. Objectives

The MPEG-4 Visual has the main target of specifying the codecs for various types of visual objects to be used in

the context of the MPEG-4 standard which adopted for the first time an object-based (and not frame-based)

visual representation paradigm. In this context, the MPEG-4 standard targets a large range of applications (e.g.

surveillance, mobile communications, streaming over the Internet/Intranet, digital TV, studio postproduction,

etc). The MPEG-4 Visual standard specifies codecs for natural and synthetic visual objects; in terms of video

codecs, it specifies both codecs for rectangular and arbitrarily shaped video objects. For rectangular objects, the

spatial resolution goes from sub-QCIF to studio resolutions around 4k×4k pixels; naturally, a frame as

considered in the previous standards, is a particular case of video object.


The MPEG-4 Visual standard includes tools for coding natural video and still images (visual textures). This

allows the coding of scenes containing both moving and still images using the same standard. Each scene to be

coded can be composed of one or several video objects. In object-based coding, the video frames are defined in

terms of Video Object Planes (VOP). Each VOP is then the momentary video representation of a specific object

of interest to be coded or to be interacted with. Each video object is encapsulated by a rectangle bounding box

which is then divided into 16×16 pixels macroblocks than can be classified as (see Figure A.18):

101

Transparent – Macroblocks in the bounding box that are completely outside the VOP; these

macroblocks do not need to be coded.

Opaque – Macroblocks in the bounding box that are completely inside the video object plane; these

macroblocks are intra or inter-coded using motion compensation and DCT encoding.

Boundary – Macroblocks in the bounding box that include the boundary of the video object plane;

these macroblocks are processed with specific tools for coding arbitrarily shaped objects.

Figure A.18 – Macroblock classification in MPEG-4 Visual [62].

Regarding rectangular (or frame-based) video coding, which is functionally similar to the frame based coding

solutions previously reviewed, there are some improvements introduced by MPEG-4 Visual, notably in terms on

motion compensation:

Quarter-pixel motion compensation – Motion compensation supports motion vectors with an

increased accuracy, notably one-quarter pixel, allowing improved predictions, thus reducing the

prediction error.

Global motion compensation – Instead of using local motion vectors for each macroblock, this tool

allows using also one motion vector for one video object plane (which may be a frame). This can be

important for sequences with a large portion of global translational motion (e.g. a camera panning) and

also for non-translational motion (e.g. zoom or rotation).

Direct mode in bidirectional prediction – This is a generalization of the “PB frames” introduced in

H.263. Both forward and backward predictions are used but the required motion vectors are derived

from the motion vector of the collocated macroblock in the backward-reference, and only a correction

term called delta vector is transmitted.

The MPEG-4 Visual rectangular video objects encoder architecture is presented in Figure A.19.

Figure A.19 – Basic MPEG-4 Visual encoder architecture (for rectangular video objects) [60].

MPEG-4 Visual still objects, also called visual textures, are coded based on a wavelet transform coding solution,

similar to the one adopted by the JPEG 2000 standard.

102


MPEG-4 Visual also uses a 2-D DCT to transform the 8×8 blocks that compose a macroblock. In MPEG-4

Visual, it is possible to quantize the DCT coefficient in two ways:

MPEG-2 Video quantization – The first quantization method is derived from the quantization used in

MPEG-2 Video. This method takes in account the properties of the human visual system, allowing a

different quantization step for each transform coefficient by means of quantization matrices. The default

MPEG-4 Visual quantization matrices are shown in Figure A.20.

8 17 18 19 21 23 25 27

17 18 19 21 23 25 27 28

20 21 22 23 24 26 28 30

21 22 23 24 26 28 30 32

22 23 24 26 28 30 32 35

23 24 26 28 30 32 35 38

25 26 28 30 32 35 38 41

27 28 30 32 35 38 41 45

16 17 18 19 20 21 22 23

17 18 19 20 21 22 23 24

18 19 20 21 22 23 24 25

19 20 21 22 23 24 26 27

20 21 22 23 25 26 27 28

21 22 23 24 26 27 28 30

22 23 24 26 27 28 30 31

23 24 25 27 28 30 31 33

Default weighting matrix forintra coded MBs

Default weighting matrix forinter coded MBs

Figure A.20 – Default MPEG-4 Visual quantization matrices [62].

H.263 quantization – The second quantization method is derived from the quantization used in H.263.

This method is less complex and easier to implement [62], but it only allows one step size value per

macroblock.

The selection of the quantization method to use is decided at the encoder side. This decision is then transmitted

to the decoder as side information. For intra-coded blocks, the DC coefficient is quantized using a fixed

quantization step size.

As mentioned before, MPEG-1 Video predicts the DC coefficients values with the values of neighbor blocks DC

coefficients. For some of the DC and AC coefficients of neighboring blocks, there exist statistical dependencies,

i.e., the value of one block can be predicted from the corresponding value of one of the neighboring blocks. This

fact is exploited in MPEG-4 video coding by the so-called DC/AC prediction. It should be noted that this

prediction is only applied in the case of intra-coded macroblocks. The idea behind the DC/AC prediction tool is

presented in Figure A.21.

Figure A.21 – DC/AC prediction process for intra-coded macroblocks [62].

For the scanning of the DCT coefficients, which corresponds to a 2D-to-1D conversion of the DCT coefficients

information, there are two additional scanning modes available, besides the traditional zigzag scanning used in

most standards, see Figure A.22.

For boundary macroblocks, this standard also supports the usage of a special transform called Shape-Adaptive

DCT. Basically, the aim of this transform is to code only the opaque pixels within the boundary macroblocks

which are not completely filled with texture data [62].

A

B C D

X MacroblockY

or or

DC value First AC row

value

First ACcolumn

value

103

Figure A.22 – Alternative MPEG-4 scanning modes for converting the 2D coefficients matrix into a 1D vector of

DCT coefficients [62].


The main target of the MPEG-4 Visual standard was not additional compression efficiency; however, some of

the MPEG-4 Visual profiles, specifically targeting frame based video coding, provide some compression

efficiency benefits regarding some of the previous standards due to the additionally included coding tools.

For higher bitrates (i.e. 5 Mbit/s to 15 Mbit/s), MPEG-2 Video is already a well performing standard and, thus,

for this range of bitrates, MPEG-4 Visual does not bring any significant improvement. However, as referred

above, for low and medium bitrates (i.e. up to 3 Mbit/s), MPEG-2 Video does not assure a good compression

performance and is even outperformed by MPEG-1 Video. Therefore, for both low and medium bitrates, MPEG-

4 Visual comes as an improvement, showing some compression performance superiority regarding MPEG-1

Video, for any type of video sequence [62].

For very low bitrates (i.e. 50 kbit/s), H.263 still provides some coding gain over MPEG-4 Visual. For these

bitrates, MPEG-4 Visual does not use all of the available coding tools in order to reduce the complexity and the

delay caused by their usage; this is made through the MPEG-4 Visual Simple Profile. However, for higher

bitrates (i.e. 1.5 Mbit/s), MPEG-4 Visual provides better compression performance than H.263, making use of

all the available coding tools (e.g. B-frames, quarter-pixel motion compensation, MPEG-2-style quantization,

global motion compensation); this is provided through the MPEG-4 Visual Advanced Simple Profile [63].

A.8. H.264/AVC Standard

The H.264/AVC standard (also known as MPEG-4 AVC, MPEG-4 Part 10 or ISO/IEC 14496-10) is a video

coding standard jointly developed by the ITU-T VCEG and ISO/IEC MPEG standardization groups; its first

version was finalized around 2003 [64].

A.8.1. Objectives

This standard has the main target to provide the same quality achieved by the already available video coding

standards (e.g. MPEG-2 Video, H.263 and MPEG-4 Visual) with substantially lower bitrates, typically half or

less the bitrate which means 50% additional compression efficiency.

Additionally, it was designed to provide enough flexibility to allow its deployment in a wide range of application

scenarios, considering low to high bitrates and low to high spatial resolutions.

0 1 2 3 10 11 12 13 0 4 6 20 22 36 38 52 0 1 5 6 14 15 27 28

4 5 8 9 17 16 15 14 1 5 7 21 23 37 39 53 2 4 7 13 16 26 29 42

6 7 19 18 26 27 28 29 2 8 19 24 34 40 50 54 3 8 12 17 25 30 41 43

20 21 24 25 30 31 32 33 3 9 18 25 35 41 51 55 9 11 18 24 31 40 44 53

22 23 34 35 42 43 44 45 10 17 26 30 42 46 56 60 10 19 23 32 39 45 52 54

36 37 40 41 46 47 48 49 11 16 27 31 43 47 57 61 20 22 33 38 46 51 55 60

38 39 50 51 56 57 58 59 12 15 28 32 44 48 58 62 21 34 37 47 50 56 59 61

52 53 54 55 60 61 62 63 13 14 29 33 45 49 59 63 35 36 48 49 57 58 62 63

Alternate

horizontal scan

Alternate

vertical scan Zigzag scan

104


The H.264/AVC standard no longer uses the object-oriented coding paradigm introduced in MPEG-4 Visual and

has returned to the usual frame-based video coding paradigm5. It only addresses rectangular objects/frames as in

the video coding standards before MPEG-4 Visual. To achieve the proposed objective, H.264/AVC uses many

new coding tools, capable of increasing the compression efficiency, typically at the cost of increasing the

encoding and decoding complexities. The main technical improvements introduced in H.264/AVC are:

Temporal redundancy tools

o Variable block size – Unlike other standards, H.264/AVC supports various block sizes for motion

estimation. For fast moving and changing areas, smaller blocks may be adopted to increase the

motion compensation accuracy. For slow moving and changing areas, larger blocks may be adopted

to save bits.

o Quarter-pixel motion estimation – This tool was already introduced in MPEG-4 Visual to

improve the motion vectors accuracy, thus increasing the compression efficiency.

o Multiple reference frames – With H.264/AVC, it is possible to adopt multiple reference frames

for a single MB (up to 31 frames), in the past or in the future. This is useful for situations where the

neighboring frames are not the most similar to the current frame.

o Generalized B-frames - Additionally, with H.264/AVC, B-frames can also be prediction reference

for other B-frames, with or without motion compensation, removing the B-frames limitations in

terms of prediction referencing, existing since MPEG-1 Video.

Spatial redundancy and irrelevancy tools

o Transform and Quantization – There are some significant improvements in H.264/AVC

concerning the transform coding and quantization process which will be analyzed in detail in the

next section.

o Intra prediction – In contrast to the previously presented video compression standards, where the

spatial redundancy is only removed by means of transform coding, H.264/AVC can predict a intra-

coded MB using pixels from neighboring macroblocks within the same frame, using after the

transform coding to complete the process and coding the intra-prediction residual. Intra prediction

may be performed for 4×4 or 16×16 blocks. For 4×4 blocks, the intra prediction can be made in 9

different ways, depending on the correlation direction between neighbor blocks. For 16×16 blocks,

four intra coding modes are possible; this intra prediction block size is typically useful for image

areas with smooth variations.

Statistical redundancy tools

o The H.264/AVC entropy coder includes two main alternatives with difference complexities and

efficiency:

Context-adaptive binary arithmetic coding which is more complex but provides additional

compression efficiency.

Context-adaptive variable-length coding (CAVLC) which is less complex but also simpler.

This alternative also uses Exponential-Golomb (Exp-Golomb) coding; Exp-Golomb is a

common simple and highly structured Variable Length Coding (VLC) technique.

Perceptual redundancy tools

o In-loop deblocking filtering – To reduce the negative subjective impact of the blocking artifacts,

and also improve the compression efficiency, H.264/AVC uses an in-loop deblocking filter. This

filter is applied to the vertical and horizontal edges of all 4×4 blocks in a macroblock.

The H.264/AVC encoder architecture is presented in Figure A.23.

5 To be more precise, H.264/AVC specifies additional codecs for rectangular objects in the context of the object-

based MPEG-4 representation framework.

105

Figure A.23 – Basic H.264/AVC encoder architecture [17].


In H.264/AVC, several transforms are specified (see Figure A.24):

A 2-D Hadamard transform of size 4×4 for the luminance DC coefficients for 16×16 intra-coded

macroblocks.

A 2-D Hadamard transform of size 2×2 for the chrominance DC coefficients of any macroblock.

A 2-D Integer DCT (ICT) of size 4×4 for all the other blocks; this is considered to be the “core”

transform.

Figure A.24 – H.264/AVC transforms [17].

The reduction of the transform block size from 8×8 (the block size used in the previous video coding standards)

to 4×4 allows a more locally-adaptive representation of the input signals. With a small sized block for motion

estimation, H.264/AVC also obtains higher temporal prediction efficiency. The ICT is based on the DCT but

with some fundamental differences:

It is an integer transform, thus, all operations can be processed with integer arithmetic, without any loss

of accuracy.

Since the inverse transform is defined by the exact integer operations, inverse-transform mismatches

between encoders and decoders should not occur.

The core part of this transform only requires additions and shifts, being easier to implement than the

DCT.

A scaling multiplication is integrated into the quantizer, reducing the total number of multiplications.

Chroma 4x4 block order for 4x4 residual

coding, shown as 16-25, and Intra4x4

prediction, shown as 18-21 and 22-25

10 4 5

2 3 6 7

8 9 12 13

10 11 14 15

-1

...

Luma 4x4 block order for 4x4 intra

prediction and 4x4 residual coding

10 4 5

2 3 6 7

8 9 12 13

10 11 14 15

-1

...

Intra_16x16

macroblock type

only: Luma 4x4 DC

2x2 DC

AC

Cb Cr16 17

18 19

20 21

22 23

24 25

2x2 DC

AC

Cb Cr16 17

18 19

20 21

22 23

24 25

Integer DCT

Hadamard

106

The block diagram for the H.264/AVC transform and quantization processes is presented in Figure A.25.

Figure A.25 – (a) Forward transform and quantization. (b) Re-scaling and inverse transform.

The forward 2-D ICT is arranged into a core transform (Cf) and a scaling matrix (Sf) defined as

(A.2)

(A.3)

In this way, the forward 2-D ICT is given by

(A.4)

where Y is the ICT coefficients matrix and X corresponds to the input block samples.

Again, the inverse 2-D ICT is arranged into a core transform (Ci) and a scaling matrix (Si) defined as

(A.5)

(A.6)

In this way, the inverse 2-D ICT is given by

(A.7)

where Z is the reconstructed block matrix.

Besides the transforms specified in the first version of H.264/AVC, a new 8×8 ICT is introduced in an extension

to the original standardization project, called the Fidelity Range Extensions (FRExt). These extensions were

developed to enable higher quality video coding. Several features were included in the FRExt project, such as an

adaptive switching between order-4 and order-8 ICTs, depending on the characteristics of the input samples.

Sometimes, a 4×4 block size can improve the temporal prediction but it may compromise the spatial compaction.

On the other hand, a 8×8 block size may allow achieving better spatial compaction while sacrificing the

temporal prediction.

107

In H.264/AVC, a quantization parameter is used to determine the quantization of the transform coefficients. This

parameter can take 52 values, which are related to the quantization step by a table. An increment of the

quantization parameter by 1 implies an increase of the quantization step by approximately 12% and a reduction

of the bitrate by approximately 12% as well [64]. The same quantization parameter is used for all the transform

coefficients in a macroblock.


H.264/AVC coding provides quality similar to MPEG-2 Video coding at approximately half the bitrate.

However, the advantage of H.264/AVC diminishes as the bitrate and spatial resolution increases, such that at

very high bitrates (above 18 Mbit/s), there is very little difference between MPEG-2 Video and H.264/AVC

[65]. This basically shows that both MPEG-2 Video and H.264/AVC have been optimized for lower resolutions

(notably H.264/AVC for CIF and ITU-R 601 resolutions) and they do not perform very well for very high

resolutions, notably beyond HD.

In comparison to H.263, the previously more efficient video coding standard for low bitrates, H.264/AVC

achieves around 24% average compression gain [66]. With this performance, the H.264/AVC standard is

currently considered the state-of-the-art in video coding for a large range of applications, bitrates and

resolutions.

109

Appendix B

Recent Advances on Transform Coding

To improve the compression efficiency of predictive video coding solutions, notably for resolutions above high

definition, some new transforms have been introduced in the recent years. In this appendix, the most relevant

advances on transform coding are briefly reviewed. Each of the four solutions presented adopts a different

approach, although always using previously reviewed concepts.

B.1. Increasing the Transform Block Size

The first solution presented was proposed by Dong et al. [67] in 2009 and consists in two 2-D order-16 integer

transforms. These transforms are expected to be more efficient in exploiting the spatial correlation present in HD

video sequences than the already used transforms (particularly 2-D order-4 and order-8 transforms).

B.1.1. Objectives

A statistical analysis of the correlation between adjacent prediction error blocks, which represent the difference

between the original and the motion compensated prediction blocks, reveals that, for higher definitions, the

spatial correlation of prediction errors increases. To explore this property, the authors of this solution propose

the usage of larger blocks transforms. In particular, they propose the usage of 16×16 blocks since previous

studies prove that there is no significant improvement in the usage of even larger blocks for HD video sequences

[68]. With 16×16 blocks, it is possible to exploit better the spatial correlation between neighboring samples at

the cost of increasing the complexity to the transform process and worsening the entropy encoder (typically run-

length coding) process since it becomes non-optimal because of the much larger dynamic ranges existent for the

various runs. The proposed order-16 transforms are developed taking in consideration these advantages and

drawbacks and are not simple extensions of the already used ICT (Section A.8.3).

It is important to note that this solution still considers the order-4 and order-8 transforms as alternatives for more

detailed areas where the spatial correlation is less significant.

110

B.1.2. Architecture and Walkthrough

The transforms proposed in [67] have the same architecture as other transforms, more particularly as the ICT

transforms used in the H.264/AVC standard (see Section A.8.3). The details on the transforms are presented in

the next section.

B.1.3. Details on the Transform

The authors propose two new 2-D order-16 transforms which are both integer and derivatives of the 2-D order-

16 ICT. The general transform matrix of an order-16 ICT, T16, is defined as [67]

(B.1)

This matrix has alternating even and odd symmetry with respect to the lines. In this way, it can be defined by its

even part, T8e, and odd part, T8o, [67]

(B.2)

(B.3)

The even part is an order-8 ICT, as the order-8 transform used in H.264/AVC, and its element set is given by

(B.4)

To maintain the orthogonality of the transform matrix, the element set of the odd part has to be represented with

large magnitudes, with at least 6 bits; however, this significantly increases the associated computational

complexity. To avoid this complexity, while maintaining the idea of better exploiting the spatial redundancy, the

authors developed the following transforms:

2-D order-16 Non-orthogonal ICT (NICT) – To reduce its complexity, this transform uses values for

the element set of the odd part that do not guarantee the orthogonality of the transform. This is a trade-

off between complexity and performance, since a non-orthogonal transform does not have the best

energy compaction performance. However, the proposed NICT conserves all the other ICT properties,

111

such as bit-exact implementation and a rather low complexity. The element set of the even part is a

scaled version of the transform matrix of the order-8 ICT from H.264/AVC given by

(B.5)

However, a non-orthogonal transform does not have a perfect reconstruction as the reconstruction errors

can even be larger than the errors introduced by the quantization process. Thus, to define the element

set of the odd part, various solutions have been analyzed to find the one with the best balance between

the approximation to the DCT performance and the magnitudes of the used values (related to the

computational complexity). With this in mind, the authors proposed the following solution:

(B.6)

This element set was selected from a group of sets tested in order to determine their DCT distortions

and upper bounds of , which is the average variance of the reconstructed error, as shown in Table

B.1.

Table B.1 – Performance comparison of various element sets [67].

2-D order-16 Modified ICT (MICT) – The second order-16 transform proposed is obtained by

modifying the structures of the order-16 ICT matrix, thus taking the name of modified ICT. This

modification is performed using the principle of dyadic symmetry6. The even part of the transform

matrix remains unaltered with the element set in Eq.(B.2), while the odd part is given by [67]

(B.7)

Since the MICT is based on the ICT, its basis vectors are inherently orthogonal no matter what the

element sets are. With this property, it is possible to select smaller magnitude elements without losing

the orthogonality. Thus, to select the best element set, it is important to obtain a trade-off between the

performance and the magnitude of the elements (related to the computational complexity). In this

solution, the authors decided to select the following element set for the odd part

6 A vector of 2

m elements is said to have Sth dyadic symmetry if , where is the

„exclusive-OR‟, j lies in the range [0, 2m-1], S lies in the range [1, 2

m-1] and c is a constant determining the type

of dyadic symmetry, i.e., if c = 1 then the symmetry is said to be „even‟ and if c = -1 then the symmetry is said to

be „odd‟.

112

(B.8)

To make this selection, the authors considered three conditions. First, the magnitudes should be

comparable to the magnitudes of the even part set; second, the MICT basis vectors waveform should

resemble the DCT one; and third, the selected set should be suitable for a fast algorithm.

As referred above, the NICT inherits the ICT fast algorithm. However, for the MICT, the authors had to develop

a new fast algorithm that is described with more detail in [67].

B.1.4. Performance Evaluation

To objectively evaluate the proposed order-16 integer transforms, they were integrated in the H.264/AVC

reference software, more specifically in the H.264/AVC High Profile. The tests were made with the conditions

listed in Table B.2.

Table B.2 – Test conditions for the NICT and MICT [67].

Platform JM11

Sequence structure IBBPBBP…

Intra frame period 0.5 s

Entropy coding Arithmetic coding

Fast motion estimation On

Deblocking filter On

R-D optimization On

Quantization Parameter Fixed (20, 24, 28, 32)

Rate control Off

Reference frame 5

Search range ±32

Frame number 60

The experimental results shown in Table B.3 allow analyzing the performance gains of the two proposed

transforms in comparison with H.264/AVC when using both the 2-D order-4 and order-8 ICTs. The

improvements are measured in terms of PSNR gain for the same bitrate or in terms of bitrate saving for the same

quality (the same PSNR).

Table B.3 – Experimental results of the proposed NICT and MICT versus H.264/AVC [67].

For the NICT, the performance gain is, on average, more than 0.2 dB. For all tested sequences, the improvement

is larger than 0.1 dB and the maximum gain is achieved for the sequence “Riverbed” with a gain up to 0.48 dB.

113

This sequence has smooth textures and global motion, so the prediction errors have low energy resulting in low

amplitude coefficients when using an order-16 transform.

For the MICT, the gains are smaller than for the NICT, on average around 0.06 dB. This transform does not

outperform the NICT for any HD sequence, but in some cases it comes very close, e.g. Crew, Pedestrian,

RushHour and Sunflower.

For both cases, the usage of a 2-D order-4 ICT does not bring notorious performance gains or losses. This is

confirmed in Figure B.1, where the percentage of macroblocks coded with each transform is shown for four

cases: (a) using order-8 ICT and order-16 NICT; (b) using order-8 ICT and order-16 MICT; (c) using order-4

and order-8 ICTs and order-16 NICT; (d) using order-4 and order-8 ICTs and order-16 MICT.

Figure B.1 – Proportion of different block size transforms for the HD sequences City, Crew, Station and

Sunflower [67].

As expected, the percentage of macroblocks using the 2-D order-4 ICT is very small for both the NICT (c) and

the MICT (d). Thus, the authors proposed a variable block size scheme that does not include an order-4

transform for HD video content. The importance of order-16 transforms for HD video coding is well shown in

Figure B.1 where, on average, more than half of the macroblocks are coded using these type of transforms. For

some HD sequences (e.g. Sunflower and Station), order-16 transforms are used in up to 80% of the macroblocks.

Figure B.1 also shows that, with the increase of the bitrate (reduction of the QP), the order-16 transforms are less

used. This is because at high bitrates more high frequency coefficients are transmitted, resulting in larger runs

for entropy encoding due to the large block size of order-16 transforms. Thus, for these cases, the order-8 ICT is

more likely to be selected.

To evaluate the subjective improvements associated with the proposed transforms, the authors used two cropped

images (150×150 pixels) from two video sequences: City (720p) and Station (1080p). The tests were made with

a QP of 32, and the experimental results shown in Figure B.2 indicate for each image the number of bits and the

associated PSNR.

114

Figure B.2 – Images cropped from City and Station using (a) and (d) H.264/AVC, (b) and (e) H.264/AVC with

additional 2-D order-16 NICT and (c) and (f) H.264/AVC with additional 2-D order-16 MICT [67].

For low bitrates (QP = 32), the usage of the order-16 transforms allow to better preserve the details, providing

better visual quality. This is noticeable in the vertical edges of the buildings in the sequence City and the

horizontal edges of the railway sleepers in the sequence Station. For higher bitrates, the quality achieved without

the usage of the order-16 transforms is good enough; thus, the usage of the order-16 transforms does not bring

any noticeable improvements.

B.1.5. Summary

This solution proposes the usage of order-16 transforms to better exploit the spatial correlation for HD videos,

which tend to have more spatial redundancy than lower resolution videos. To this end, two order-16 integer

transform are proposed: one non-orthogonal ICT and a modified ICT. Both allow a more free selection of the

transform matrix elements with this selection made having in mind a compression performance versus

complexity trade-off.

The developed transforms have been integrated in the H.264/AVC standard, along with the order-4 and order-8

ICTs. This variable block size scheme is later reduced by removing the order-4 ICT since it is proven not be very

useful for HD video. The experimental results show that both 2-D order-16 integer transforms can improve the

current H.264/AVC coding efficiency, particularly for HD video coding.

B.2. Directional Discrete Cosine Transforms

The second novel transform solution to be reviewed in this section is based on one of the directional transform

approaches mentioned in Section 2.1.5 and was proposed by Zeng and Fu in 2008 [69]. This transform uses a

directional DCT to provide a better compression performance for image blocks containing directional edges.

B.2.1. Objectives

The main objective of this novel transform is to better exploit the spatial correlation within each block,

particularly when the block has some directional edges other than horizontal and vertical. Currently, the 2-D

DCT (or ICT for the H.264/AVC case) used in most image and video coding standards only exploits the

correlation in terms of vertical and horizontal edges, performing two separable 1-D transforms in both directions.

This is useful since human eyes are highly sensitive to vertical and horizontal edges and a lot of image blocks do

115

contain these types of edges. However, many images have other directional edges, whose spatial redundancy is

not totally exploited with the currently used non-directional transforms.


The transform introduced in this solution is performed in three steps illustrated in Figure B.3.

Figure B.3 – Transform architecture.

A short walkthrough of the transform is presented next:

1. 1-D Directional DCT – First, a 1-D DCT is performed along the direction of the detected edge; the

DCT coefficients are then arranged into a group of column vectors.

2. 1-D Horizontal DCT – Next, the second 1-D DCT is applied to each row; the resulting coefficients are

then pushed horizontally to the left in order to facilitate the next step.

3. Modified Zigzag Scan – Finally, the DCT coefficients are zigzag scanned to convert them into a 1-D

sequence to be used for run-length based VLC.

To better understand this process, consider an 8×8 image block with a vertical-right edge. The illustration of the

three steps above is shown in Figure B.4.

Figure B.4 – (a) 1-D Directional DCT along the vertical-right direction in the first step. (b) 1-D Horizontal DCT

in the second step. (c) Modified Zigzag Scan in the last step [69].

As referred in Section 2.1.5 and illustrated in Figure B.4 (b), the transform performed in the second step is

horizontal since the first row contains all DC coefficients and each of the other rows contains all AC coefficients

with the same index.


Taking advantage of the directional intra prediction modes used in H.264/AVC, the 1-D directional DCT

included in this solution uses six directional modes defined in a similar way. These modes are presented in

Figure B.5 for a 8×8 block. It must be noted that Mode 0 (vertical prediction) and Mode 1 (horizontal prediction)

are not defined since these directions are already exploited in the non-directional 2-D DCT. Naturally, Mode 2

(the dc mode) is not used since it is not a directional mode.

116

Figure B.5 – Six directional modes similar to those used in H.264/AVC intra prediction for the 8×8 block size

[69].

To perform these six directional transforms, only two basis functions are necessary. In this case, only the basis

functions for Mode 3 DCT, which is a directional DCT performed along the direction defined by Mode 3, and

Mode 5 DCT, which is a direction DCT performed along the direction defined by Mode 5, are defined, besides

the basis functions for the non-directional DCT (see Figure B.6). The basis functions for the other prediction

modes may be easily obtained by a symmetric transformation (flipping or transposing) applied to the Mode 3

and 5 basis functions. Mode 4 can be obtained by flipping Mode 3 either horizontally or vertically; Mode 6 can

be obtained by transposing Mode 5, and Mode 7/8 can be obtained by flipping Mode 5/6, either horizontally or

vertically.

Figure B.6 – Basis function images for the non-directional DCT (Mode 0/1), Mode 3 DCT and Mode 5 DCT for

a 8×8 block size [69].

A directional DCT (chosen from Modes 3-8) cannot be applied directly to the image blocks, because they would

suffer from the so-called mean weighting defect. This defect is related to the different weighting factors used in

the various transforms applied to a block, which can produce more non-zero AC coefficients than needed. To

solve this problem, this solution proposes the utilization of a DC correction method which comprises by two

steps:

1. DC separation

First, the mean value m of a block is computed and quantized like the DC component of the non-

directional 2-D DCT. Then, m is subtracted from the initial block samples resulting in

. Next, the transforms are performed, as illustrated in Figure B.3, and the resulting coefficients are

pushed horizontally to the left and denoted as for .

2. ΔDC correction

In this step, the DC component is set to zero while all the other coefficients are quantized and denoted as

with . Next, in the inverse transform process, the first IDCT is applied to each row of

, and the

resulting coefficients are denoted as . Then, a ΔDC correction term is computed as

117

(B.9)

where is the length of the kth column. The correction term is then subtracted from each

for

. After the ΔDC correction, the second IDCT is performed on each column of and the

results are placed back in the corresponding diagonal down-left line to generate a reconstructed N×N block.

Finally, the quantized mean value is added back to the reconstructed block.


To assess the performance of the proposed transforms, the authors selected four video sequences: Akiyo,

Foreman, Stefan and Mobile (all in CIF format). The first frames of these video sequences are shown in Figure

B.7. The video sequences were coded with H.263‟s quantization/VLC while fixing the block size at 8×8.

Figure B.7 – First frames of the selected video sequences [69].

The RD performance results (PSNR versus bit/pixel) comparing the use of a non-directional DCT (called here

Conventional DCT) and the proposed directional DCT are shown in Figure B.8.

Figure B.8 – RD performance for the first frames of Akiyo, Foreman, Mobile and Stefan [69].

The results in Figure B.8 show that only a very marginal RD performance gain has been achieved. This is due to

the fact that most blocks have chosen the prediction Mode 0 or 1 (for which there is zero gain). In this context, to

better show the contribution of Modes 3-8, the RD performance fir these modes is isolated in the charts

presented in Figure B.9 meaning that only the blocks selecting Modes 3-8 are considered.

118

Figure B.9 – RD performance for the first frames of Akiyo, Foreman, Mobile and Stefan when only the blocks

selecting Modes 3-8 are considered [69].

By isolating the results for Modes 3-8, it is possible to observe a clear gain associated to the directional DCT

over the non-directional DCT performance. This gain is more noticeable for the Akiyo and Foreman cases, where

the gain ranges from about 0.5 dB in the high bitrates to about 2 dB in the low bitrates.

To analyze the results of the directional DCT for motion-compensated residual frames, the motion vectors

between frames 2 and 3 and between frames 50 and 51 of Foreman and Mobile are generated using a search

window of size ±7×±7. Then, as before, the directional and non-directional DCTs are applied. The experimental

results are presented in Figure B.10 and Figure B.11, with the later considering only the blocks using Modes 3-8.

Figure B.10 – RD performance for the motion compensated residual frames of Foreman and Mobile [69].

119

Figure B.11 – RD performance for the motion compensated residual frames of Foreman and Mobile when only

the blocks selecting Modes 3-8 are considered [69].

From the observation of Figure B.10 and Figure B.11, it is clear that RD performance gains are also achieved for

all residual frames. Compared to the intra-coding, the coding gain becomes even more significant; thus, the

directional transform seems to be even more useful for inter-coding.

B.2.5. Summary

This solution proposes a block-based directional DCT which takes into consideration the direction of the block

edges in a digital image. With this directional transform, it is possible to exploit the directional edges existent in

a particular block, besides the horizontal and vertical directions. This is done using a 1-D transform applied in

the direction of the edge and a second 1-D transform applied in the horizontal direction. In this solution, the

novel directions used are based on the intra prediction modes used in H.264/AVC. Experimental results show

that this transform can achieve relevant compression gains regarding non-directional transforms, specially for

images with significant directional information.

B.3. 3-D Spatial and Temporal Transform

The third solution to be reviewed in this section was proposed by Furht et al. in 2003 [70] and it involves a 3-D

transform like those introduced in Section 2.1.4.

B.3.1. Objectives

In a video sequence, besides the spatial correlation within each frame there is also the temporal correlation

between neighbor frames. To exploit this correlation, this novel transform adopts a 3-D DCT. However, for

video sequences with high motion, the performance of a 3-D transform may be highly degraded since there is not

much temporal correlation. To solve this problem, the authors of the solution proposed an adaptive cube-size 3-

D DCT technique that dynamically performs motion analysis to adapt in accordance the size of the video cube to

be transformed and compressed.


The architecture of the adaptive 3-D DCT encoder including the proposed 3-D transform is presented in Figure

B.12.

120

Figure B.12 – Architecture of the adaptive 3-D DCT encoder [70].

A short walkthrough of this architecture is presented next:

1. Motion analyzer – First, the video sequence is analyzed to determine the level of motion. To perform

this analysis, 16×16×8 video cubes are used where the third dimension is time. There are three levels of

motion specified: no motion, low motion and high motion.

2. Selection of the cube size – Based on the determined level of motion, the adequate cube size is

selected. For high motion video, the spatial size of the cubes is reduced to prevent the degradation of

the image quality. Naturally, this operation also leads to a lower compression rate for a target quality.

3. Forward 3-D DCT – Next, the 3-D DCT is applied to the selected video cube. This transform is

described with more detail in the next section.

4. Quantization – The coefficients are then quantized to exploit the visual irrelevancy. The quantization

step depends on the type of motion, i.e., for high motion cubes the quantization step is lower than for

low motion cubes.

5. Huffman encoding – Finally, the resulting quantized coefficients are entropy encoded using a lossless

variable length coding Huffman algorithm.

For the 3-D DCT decoder, the encoder steps are performed in the reverse order, except the motion analysis.


There are two main tools introduced in this 3-D transform based video coding solution, which are now presented

with more detail:

Forward 3-D DCT – As noted in Section 2.1.4, to perform a 3-D transform it is necessary to divide the

video data in 3-D video cubes. Considering that Nc×Nr is a block of pixels in a frame and Nf is the

number of successive frames, the video cube has size Nc×Nr×Nf. In this way, the forward 3-D DCT

used in this solution is defined as

(B.10)

where

(B.11)

121

Motion analysis and selection – As noted before, fixed 16×16×8 video cubes are used for the motion

analysis. To determine the level of motion for each 16×16×8 video cube, the Normalized Pixel

Difference (NPD) between the first and the eight frame is computed as

(B.12)

where X(i)1 are pixels from the first frame, X(i)8 are pixels from the eight frame and N is the total

number of pixels in a 16×16 block (N=256). The motion levels are then defined as

(B.13)

where t1 = 5 and t2 = 25. The values of t1 and t2 were selected based on a set of extensive experiments.

The cube sizes used for each motion level are shown in Table B.4 and are explained next:

Table B.4 – Cube Size for each Motion Level.

Motion Level Cube Size

No motion 16×16×1

Low motion 16×16×8

High motion 8×8×8

o No motion – When there is no motion detected by the motion analyzer, the 3-D DCT is

applied to a 16×16×1 cube. Basically, this means that a 2-D DCT is applied to the 16×16 block

in the first frame only since the remaining blocks are very similar. In the decoding process, the

reconstruction of the corresponding block in the other seven frames will be replicated from the

first frame.

o Low motion – If there is low motion detected, the cube size remains unchanged and the 3-D

DCT is applied to a 16×16×8 cube; this allows an improved compression ratio while

maintaining a high quality.

o High motion – When the motion analyzer detects high motion, the cube is subdivided into

8×8×8 cubes and the 3-D DCT is then applied. With this approach, it is possible to achieve a

better quality versus rate trade-off.

As noted in [70], another motion level could be included for cubes with even higher motion than the one defined

before (t2). With this additional level, these higher motion cubes could use a 4×4×8 3-D DCT.


To evaluate the performance of the proposed adaptive 3-D DCT, the novel transform, solution was applied to

two video sequences: Security, which is a low motion video sequence, and Football, which is a high motion

video sequence. The performance was assesses using the compression ratio, the number of bits/pixel and the

PSNR. The quantization tables (QT) were created using the following formula:

(B.14)

where Q(i,j) are so-called quantization coefficients and quality specifies the quality factor. The quality factor

recommended range is from 1 to 25, with 1 corresponding to the best quality.

The results achieved when applying the adaptive 3-D DCT to eight frames of the sequence Security are

presented in Table B.5 while Figure B.13 shows some example decoded frames.

122

Table B.5 – Adaptive 3-D DCT applied to Security sequence [70].

Figure B.13 – First frame for the Security sequence: (a) original, (b) quality=5, (c) quality=10 and (d)

quality=20 [70].

The experimental results show that the proposed adaptive 3-D DCT can provide better compression (or a lower

number of bits/pixel) while maintaining a good video quality. For a quality factor of 20, the video quality suffers

from the high quantization steps, showing some quality artifacts.

Next, for the video sequence Football, the authors also assessed the performance of the non-adaptive 3-D DCT

using 8×8×8 video cubes (besides the adaptive 3-D DCT). This video sequence has 56 frames, and the motion

analyzer detected 1116 (40%) video cubes with high motion, 875 (32%) with low motion and 781 (28%) with no

motion. The results of these experiments are shown in Table B.6 and Table B.7.

Table B.6 – Adaptive 3-D DCT applied to Football sequence [70].

Table B.7 – Non-adaptive 3-D DCT applied to Football sequence [70].

In comparison to the non-adaptive 3-D DCT, the adaptive 3-D DCT shows a superior performance, achieving

higher compression ratios while maintaining a similar quality. In the last row of Table B.6, different quality

123

factors are used for different motion levels. For high motion cubes, the quality factor is 5 while for low and no

motion cubes the quality factor is 10. This selective quantization approach results in a similar distortion/quality

compared to the usage of a fixed quality factor of 5 while the compression ratio is much higher.

B.3.5. Summary

The solution presented in this section uses an adaptive 3-D DCT technique for video compression. This means

that the size of the 3-D transform is variable, depending on the level of motion in each particular video sequence.

For low motion sequences, this approach can obtain compression ratios from 1:300 to 1:400 while still

maintaining a relatively good video quality. This may be useful for low motion applications, such as

videotelephony, videoconference, surveillance, etc. Even for higher motion video sequences, this solution can

achieve compression ratios in the range of 80-150 while providing a high quality. This may be useful for

applications such as Digital TV and HDTV.

As referred during this review, this solution can be further improved with the addition of more motion levels and

the consequent extension of the video cubes size, using smaller sizes for higher motion cubes. It can also be

improved by using adaptive quantization tables depending on the motion level.

B.4. Multi-Dimensional Spatial Transform

The last solution to be reviewed in this section was developed by Choi et al. in 2008 [71] and proposes a so-

called Multi-Dimensional Transform (MDT).

B.4.1. Objectives

The main objective of the new transform reviewed here is to better exploit the spatial redundancy between

neighbor blocks in a video sequence. This is done by means of a novel MDT tool which exploits the correlation

between neighbor blocks, besides the correlation within blocks. This can greatly improve the compression

performance in comparison to the current state-of-the-art on video coding, the H.264/AVC standard. As referred

in Section A.8, H.264/AVC uses a 4×4 ICT. This locally-adaptive approach is useful to provide high temporal

prediction efficiency; however, because of the small block size used, the spatial redundancy reduction is limited.

With the MDT proposed in this solution, the authors target further exploiting the spatial redundancy while

maintaining the H.264/AVC temporal redundancy reduction capacity.


The developed MDT may have three (3DT) or four (4DT) dimensions. There are two types of 3DT: horizontal

direction 3DT (H3DT) and vertical direction 3DT (V3DT). The H3DT is applied to 16×8 sub-macroblocks and

the V3DT is applied to 8×16 sub-macroblocks. The 3DT block diagram is presented in Figure B.14.

Figure B.14 – Block diagrams for (a) H3DT and (b) V3DT [71].

124

Next, a short walkthrough of the H3DT process is presented; for the V3DT, the walkthrough is similar with the

exception of the direction.

1. Block rearrangement – After each 4×4 block is transformed using a 2-D transform (like in

H.264/AVC), the resulting coefficients are grouped in sixteen 4×1 arrays including the coefficients in

the same position of each of the four blocks.

2. 1-D transform – Next, these arrays are transformed using a 1-D transform.

3. Block reconstruction – Finally, sixteen coefficients corresponding to the same position among sixteen

4×1 blocks are collected in a 4×4 block.

For each sub-macroblock, this process is performed twice. The 4DT is performed on 16×16 macroblocks. The

4DT block diagram is presented in Figure B.15.

Figure B.15 – Block diagrams for the 4DT [71].

A short walkthrough of the 4DT process is presented next:

1. Block rearrangement – After performing a 2-D transform over all sixteen 4×4 blocks (like in

H.264/AVC), the resulting coefficients corresponding to the same spatial frequency are arranged in

sixteen 4×4 blocks.

2. 2-D transform – Next, a 2-D transform is performed over each 4×4 transform coefficient block.

Thus, both 3DT and 4DT produce 4×4 coefficients for each coefficient position among the sixteen 4×4 blocks.


The proposed MDT is an integer-based transform. To implement the integer calculation of the transform, the

MDT can be divided into a core-transform part and a post-scaling part. For the 3DT, the core-transform part

consists of a H.264/AVC 2-D ICT and an additional 1-D ICT. The post-scaling part is separated in a 2-D post-

scaling and a 1-D post-scaling (see Figure B.16).

Figure B.16 – Core-transform and Scaling parts for the 3DT [71].

To better understand the process by which the 3DT and the 4DT are obtained, the previously studied 2-D DCT

and ICT are defined again. The 4×4 forward DCT can be computed as

(B.15)

where

125

(B.16)

The 4×4 forward ICT in H.264/AVC can be computed as

(B.17)

where CXCT is a 2-D core-transform, Ef is a matrix of scaling factors and the symbol indicates that each

element of CXCT is multiplied by the scaling factor in the same position in matrix Ef. In this way, the 3DT can be

represented by simply adding both the 1-D core-transform and the 1-D post-scalling,

(B.18)

where W is the calculated matrix of CXCT. Considering RT the matrix computed after the 2-D and 1-D core-

transforms, the 3DT can be represented by

(B.19)

where the scaling process for the 3DT can be represented by

(B.20)

with

.

In this solution, the authors have also designed a MDT quantizer. Considering that Zij is the quantized value of a

3DT coefficient, it is defined by

(B.21)

126

where Qstep is a quantization step size. Considering now

as the result of the transform and quantization

module, it can be expressed as

(B.22)

where

(B.23)

(B.24)

(B.25)

For the 4DT, the derivation process corresponds to the simple expansion of the 3DT process.


To assess the performance of the proposed MDT, four video sequences have been used: Foreman, Harbour,

Carphone and Container. All these sequences consist of 300 frames, at CIF resolution, encoded at 30 frames/s.

The MDT is integrated in H.264/AVC and its baseline profile is used to code each sequence; only the first frame

is intra coded. In terms of quantization, five different quantization parameters (QP) were used: 32, 35, 38, 42 and

45. These experiments were also made using the H.264/AVC transform and quantizer. The results of these

experiments are shown in Figure B.17.

Figure B.17 – RD performance of the MDT versus the H.264/AVC 4×4 transform [71].

Figure B.17 shows that the usage of the MDT proposed in this solution brings a clear performance gain in

comparison to the 4×4 ICT and the quantization process used in H.264/AVC. With the MDT, it is possible to

achieve a quality improvement of 1-2 dB for QP above 24.

127

B.4.5. Summary

In this solution, the authors propose a MDT with high energy compaction capabilities. The MDT considers three

modes which are used depending on the block size: a H3DT for 16×8 sub-macroblocks, a V3DT for 8×16 sub-

macroblocks and a 4DT for 16×16 macroblocks. The experimental results show that this transform can provide

better compression efficiency than H.264/AVC, thus offering more quality for the same bitrate.

129

References

[1] Mobile phone image: http://larryfire.wordpress.com/2009/01/19/youtube-offering-ipod-ready-video-

downloads/.

[2] Personal computer image: http://www.bell.ca/shopping/PrsShpTv_TV_online.page.

[3] LCD TV image: http://www.123brackets.co.uk/blog/2008/11/29/disposing-of-an-old-or-broken-plasma-or-

lcd-tv/.

[4] Ultra-high definition TV image: http://www.hdtvinfo.eu/news/hdtv-articles/82-inch-ultra-hd-lcd-tv-from-

samsung.html.

[5] Compression artifact: http://en.wikipedia.org/wiki/Compression_artifact.

[6] A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression and Standards, 2nd ed.

New York: Plenum Press, 1995.

[7] R. Westwater and B. Furht, Real-Time Video Compression - Techniques and Algorithms. Norwell, United

States of America: Kluwer Academic Publishers, 1997.

[8] Principal component analysis: http://en.wikipedia.org/wiki/Principal_component_analysis.

[9] Temics: Aurélie Martin: http://www.irisa.fr/temics/staff/martin/.

[10] Fast Fourier transform: http://en.wikipedia.org/wiki/Fast_Fourier_transform.

[11] Discrete cosine transform: http://en.wikipedia.org/wiki/Discrete_cosine_transform.

[12] Fast Hadamard transform: http://en.wikipedia.org/wiki/Fast_Hadamard_transform.

[13] Discrete wavelet transform: http://en.wikipedia.org/wiki/Discrete_wavelet_transform.

[14] F. Pereira. Digital Image Compression:

http://www.img.lx.it.pt/~fp/cav/ano2009_2010/Slides%202010/CAV_5_Digital_Pictures_2010_Web.pdf.

[15] M. Biswas, M.R. Pickering, and M.R. Frater, "Improved H.264-Based Video Coding Using an Adaptive

Transform," in Proceedings of 2010 IEEE 17th International Conference on Image Processing, Hong

Kong, September 2010, pp. 165-168.

[16] P. Waldemar, S.O. Aase, and J.H. Husoy, "A Critique of SVD-based Image Coding Systems," in

Proceedings of the 1999 IEEE International Symposium on Circuits and Systems, Orlando, FL, USA, July

130

1999, pp. 13-16.

[17] F. Pereira. Advanced Multimedia Coding:

http://www.img.lx.it.pt/~fp/cav/ano2009_2010/Slides%202010/CAV_9_Advanced_Compression_2010_W

eb.pdf.

[18] M.R. Pickering, Optimum Basis Function Estimation for Inter-frame Prediction Errors, 2010, Internal

document.

[19] H.264/AVC Software Coordination: http://iphome.hhi.de/suehring/tml/.

[20] HEVC software coordination: http://hevc.kw.bbc.co.uk/trac/browser/tags/0.9.

[21] JCT-VC, "Draft call for proposal on High-Performance Video Coding (HVC)," in Doc. N11113, Kyoto, JP,

January 2010.

[22] JCT-VC, "Suggestion for a Test Model," in JCTVC-A033r1, 1st meeting, Dresden, Germany, April 2010.

[23] M. Naccari and F. Pereira, "Integrating a Spatial Just Noticeable Distortion Model in the Under

Development HEVC Codec," in International Conference on Acoustics, Speech and Signal Processing,

Prague, Czech Republic, May 2011.

[24] M. Naccari, Recent Advances on High Efficiency Video Coding (HEVC), 2010, Internal document.

[25] W.H. Chen, C. Smith, and S. Fralick, "A Fast Computational Algorithm for the Discrete Cosine

Transform," IEEE Transactions on Communications, vol. 25, no. 9, pp. 1004-1009, September 1977.

[26] MATLAB: http://www.mathworks.com/products/matlab/.

[27] Discrete cosine transform matrix: http://www.mathworks.com/help/toolbox/images/ref/dctmtx.html.

[28] Rotation matrix: http://en.wikipedia.org/wiki/Rotation_matrix.

[29] Sine function: http://www.mathworks.com/help/techdoc/ref/sin.html.

[30] Cosine function: http://www.mathworks.com/help/techdoc/ref/cos.html.

[31] Reshape function: http://www.mathworks.com/help/techdoc/ref/reshape.html.

[32] Eigenvalues and eigenvectors function: http://www.mathworks.com/help/techdoc/ref/eig.html.

[33] Quantization (singal processing): http://en.wikipedia.org/wiki/Quantization_(signal_processing).

[34] 4x4 Transform and Quantization in H.264/AVC: http://www.vcodex.com/h264transform4x4.html.

[35] F. Pereira. Digital Image Compression:

http://amalia.img.lx.it.pt/~fp/cav/ano2010_2011/Slides%202011/CAV_5_Digital_Pictures_2011_Web.pdf.

[36] LZ77 and LZ78: http://en.wikipedia.org/wiki/LZ77_and_LZ78.

[37] Data compression LZ77: http://jens.quicknote.de/comp/LZ77-JensMueller.pdf.

[38] JCT-VC, "Common Test Conditions and Software Reference Configurations," in JCTVC-B300, 2nd

meeting, Geneva, Switzerland, July 2010.

[39] Peak signal-to-noise ratio: http://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio.

[40] G. Bjontegaard, "Calculation of the Average PSNR Differences Between RD-curves," in 13th VCEG-M33

meeting, Austin, TX, USA, April 2001.

[41] G. Valenzise. Bjontegaard metric: http://home.dei.polimi.it/valenzise/software.htm.

[42] ITU-T, Recommendation T.81, 1992.

[43] JPEG: http://en.wikipedia.org/wiki/JPEG.

131

[44] M.W. Marcellin, M.J. Gormish, A. Bilgin, and M.P. Boliek, "An Overview of JPEG-2000," Proc. of IEEE

Data Compression Conference, pp. 523-541, 2000.

[45] A.N. Skodras, C.A. Christopoulos, and T. Ebrahimi, "JPEG2000: The upcoming still image compression

standard," Elsevier Science B.V., 2001.

[46] JPEG 2000: http://en.wikipedia.org/wiki/JPEG_2000.

[47] C. Christopoulos, A. Skodras, and T. Ebrahimi, "The JPEG2000 Still Image Coding System: An

Overview," IEEE Transactions on Consumer Electronics, vol. 46, no. 4, pp. 1103-1127, November 2000.

[48] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi, "The JPEG 2000 Still Image

Compression Standard," IEEE Signal Processing Magazine, pp. 36-58, September 2001.

[49] M. Liou, "Overview of the px64 kbit/s Video Coding Standard," Communication of the ACM, vol. 34, no.

4, pp. 59-63, April 1991.

[50] M. Handley, H.261 Video: http://www.cs.ucl.ac.uk/teaching/GZ05/08-h261.pdf.

[51] H.261 Video Coding: http://www-mobile.ecs.soton.ac.uk/peter/h261/h261.html.

[52] MPEG-1: http://en.wikipedia.org/wiki/MPEG-1.

[53] MPEG-1: http://www.cs.ucf.edu/courses/cap6411/MPEG-1.PDF.

[54] F. Pereira. Digital Video Storage:

http://www.img.lx.it.pt/~fp/cav/ano2009_2010/Slides%202010/CAV_7_AV_Storage_2010_Web.pdf.

[55] T. Von Roden, "H.261 and MPEG1 - A Comparison," Conference Proceedings of the 1996 IEEE Fifteenth

Annual International Phoenix Conference on Computers and Communications, pp. 65-71, March 1996.

[56] MPEG-2 Part 2: http://en.wikipedia.org/wiki/H.262/MPEG-2_Part_2.

[57] S. Liu, "Performance Comparison of MPEG1 and MPEG2 Video Compression Standards," IEEE

Proceedings of COMPCON, pp. 199-203, 1996.

[58] L. Maki, Video Compression Standards:

http://www.cctvone.com/pdf/FAQ/Video%20Compression%20Standards%20Journal.pdf.

[59] ITU-T, Recommendation H.263, 1996.

[60] Brogent Technologies Inc.: http://www.brogent.com/brogentENG/eng/tech/video.htm.

[61] B. Girod, E. Steinbach, and N. Färber, "Comparison of the H.263 and H.261 Video Compression

Standards," in Standards and Common Interfaces for Video Information Systems, 1995.

[62] F. Pereira and T. Ebrahimi, Eds., The MPEG-4 Book.: Prentice Hall, 2002.

[63] K. Panusopone and A. Luthra, "Performance Comparison of MPEG-4 and H.263+ for Streaming Video

Applications," Circuits Systems Signal Processing, vol. 20, no. 3, pp. 293-309, 2001.

[64] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264/AVC Video Coding

Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576,

July 2003.

[65] M.H. Pinson, S. Wolf, and G. Cermak, "HDTV Subjective Quality of H.264 vs. MPEG-2, with and without

Packet Loss," IEEE Transactions on Broadcasting, vol. 56, no. 1, pp. 86-91, March 2010.

[66] N. Kamaci and Y. Altunbasak, "Performance Comparison of the Emerging H.264 Video Coding Standard,"

IEEE International Conference on Multimedia and Expo (ICME), pp. 6-9, 2003.

[67] J. Dong, K.N. Ngan, C.K. Fong, and W.K. Cham, "2-D Order-16 Integer Transforms for HD Video

Coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 10, pp. 1462-

1474, October 2009.

132

[68] S. Ma and C.-C. Kuo, "High-definition Video Coding with Super-macroblocks," Proceedings SPIE Visual

Communications and Image Processing, vol. 6508, no. 16, pp. 1-12, January 2007.

[69] B. Zeng and J. Fu, "Directional Discrete Cosine Transforms - A New Framework for Image Coding," IEEE

Transactions on Circuits and Systems for Video Technology, vol. 18, no. 3, pp. 305-313, March 2008.

[70] B. Furht, K. Gustafson, H. Huang, and O. Marques, "An Adaptive Three-Dimensional DCT Compression

Based on Motion Analysis," Proceedings of the ACM Symposium on Applied Computing, pp. 765-768,

2003.

[71] W.J. Choi, S.Y. Jeon, C.B. Ahn, and S.J. Oh, "A Multi-Dimensional Transform for Future Video Coding,"

The 23rd International Technical Conference on Circuits/Systems, pp. 1601-1604, July 2008.

Documents

Advances on Transforms for High Efficiency Video Coding