Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Advances on Transforms for High Efficiency Video Coding
Miguel Lobato de Faria Pereira Capelo
Dissertation submitted for obtaining the degree of
Master in Electrical and Computer Engineering
Jury
President: Prof. José Bioucas Dias
Supervisor: Prof. Fernando Pereira
Co-Supervisor: Dr. Matteo Naccari
Members: Prof. Luís Ducla Soares
April 2011
i
Acknowledgments
First, I would like to thank Prof. Fernando Pereira for giving me this opportunity and for supervising my Thesis.
The constant availability he showed to address my questions and the great amount of time he spent helping me
improve this work were essential to its conclusion. His effective working methodology and organization really
helped elevate my working standards and will serve as a reference for all my future life.
I would also like to express my gratitude to Dr. Matteo Naccari for sharing his vast technical knowledge and
experience. I would like to thank him for always showing interest in my work by providing new inputs and
precious advices, even when that meant injury to his own working schedule.
A special thanks to all the Image Group members for providing such a great work environment and for always
being available to help.
I would also like to thank my Mother and my Father for giving me all the possible conditions to get to this stage
and for always trusting my decisions. A special word to my Brother for challenging me to become a better
person by his example. I would also like to show my gratitude to all my family for their support and motivation.
Finally, I would like to thank all my friends, specially the ones who helped me get this far in my academic life
and those who kept me motivated and high-spirited during this period. A final word of thanks to my friend
André Martins for his companionship over the last months of work.
iii
Abstract
Nowadays, we assist to the massification of digital video in several multimedia applications. Digital video
coding plays a big role in this phenomenon, as it provides the necessary data compression to allow the
transmission and storage of digital video contents in the currently available supports and networks. However,
with the increasing presence of high and ultra high definition video contents resultant from the continuous
advances in video capturing and display technologies, the current state-of-the-art video coding standard, the
H.264/AVC standard, does not seem to provide the required compression ratios needed for their transmission
and storage in the currently available facilities. This fact has led to the need of new video coding tools than can
provide further compression efficiency regarding the H.264/AVC state-of-the-art. As an answer to these needs,
the ITU-T VCEG and ISO/IEC MPEG standardization bodies have started a new video coding standardization
project called High Efficiency Video Coding (HEVC) targeting the reduction of the coding rates in 50% for the
same quality.
In this context, this Thesis focus on the study, implementation and assessment of a novel coding technique
related to the important transform coding module, always present in the omnipresent predictive video coding
architectures. With this objective in mind, the state-of-the-art on transform coding is reviewed and the adopted
transform coding technique is presented. Since the adopted transform coding technique is intended for
integration in the emerging HEVC standard, the new coding tools introduced by this video coding standard are
also studied. Finally, a video coding solution using the adopted transform coding technique combined with the
HEVC framework is developed, implemented and evaluated.
The performance results obtained with the adopted transform coding technique reveal encouraging results in
terms of bitrate savings or quality gains when compared to the usual DCT, particularly for high definition video
content.
The main innovations present in this Thesis are related to the combination of the adopted transform coding
technique in the HEVC standard and to its performance evaluation for high definition video contents.
Keywords – Transform coding, discrete cosine transform, Karhunen-Loève transform, adaptive transform, High
Efficiency Video Coding standard.
v
Resumo
Actualmente, assistimos a uma massificação do uso de vídeo digital em diversas aplicações multimédia. A
codificação de vídeo digital desempenha um papel central neste fenómeno, possibilitando a transmissão e o
armazenamento deste tipo de dados através da sua compressão eficiente. No entanto, com o aumento da presença
de conteúdos vídeo de alta e ultra-alta definição resultante dos contínuos avanços verificados nas tecnologias de
captura e visualização de vídeo, a actual norma de codificação de vídeo de última geração, a norma H.264/AVC,
parece não conseguir atingir os factores de compressão necessários para a transmissão e armazenamento deste
tipo de conteúdos com os actuais recursos de transmissão e armazenamento. Neste contexto, existe a necessidade
de desenvolver novas ferramentas de codificação de vídeo que possibilitem o aumento dos factores de
compressão actualmente atingidos com a norma H.264/AVC. Em resposta a este necessidade, a ITU-T VCEG e
o ISO/IEC MPEG iniciaram um novo projecto com o objectivo de desenvolver uma nova norma de codificação
de vídeo denominada High Efficiency Video Coding (HEVC) e com o objectivo de alcançar reduções de débito
de 50% para a mesma qualidade.
Neste contexto, o trabalho desenvolvido nesta Tese está relacionado com o desenvolvimento, a implementação e
avaliação de uma nova técnica de codificação destinada ao módulo de compressão das transformadas que é
essencial nas arquitecturas preditivas de codificação de vídeo. Com este objectivo em mente, o estado da arte da
codificação com transformada é revisto e a técnica de codificação utilizada é apresentada. Como se pretende
combinar esta técnica com a norma emergente HEVC, as novas ferramentas de codificação introduzidas por esta
norma de codificação vídeo são igualmente estudadas. Finalmente, procede-se ao desenvolvimento, à
implementação e à avaliação de uma solução de codificação de vídeo que faz uso da técnica de codificação de
transformada adoptada no contexto da norma HEVC.
Os testes de desempenho realizados com esta técnica de codificação revelam resultados encorajadores em termos
de poupanças nas taxas de bits ou ganhos de qualidade quando comparados com a vulgarmente utilizada DCT.
Isto verifica-se especialmente para conteúdos vídeo de alta de definição.
As principais inovações apresentadas nesta Tese estão relacionadas com a combinação da técnica de codificação
de transformada adoptada na norma HEVC e a avaliação de desempenho feita para conteúdos vídeo de alta
definição.
Palavras-chave – Codificação com transformada, transformada discreta de co-seno, transformada de Karhunen-
Loève, transformada adaptável, norma High Efficiency Video Coding.
vii
Table of Contents
Chapter 1 - Introduction.......................................................................................................... 1
1.1. Context and Emerging Problem ............................................................................................................... 1
1.2. Objectives .................................................................................................................................................... 2
1.3. Thesis Structure ......................................................................................................................................... 3
Chapter 2 – Reviewing the State-of-the-Art on Transform Coding .................................... 5
2.1. Basics on Transform Coding ..................................................................................................................... 5
2.1.1. Unitary Transforms ............................................................................................................................... 7
2.1.2. One-Dimensional Transforms ............................................................................................................... 7
2.1.3. Two-Dimensional Transforms .............................................................................................................. 8
2.1.4. Three-dimensional Transforms ............................................................................................................. 9
2.1.5. Directional Transforms ......................................................................................................................... 9
2.2. Most Important Transforms ................................................................................................................... 10
2.2.1. Karhunen-Loève Transform ................................................................................................................ 10
2.2.2. Discrete Fourier Transform ................................................................................................................. 11
2.2.3. Discrete Cosine Transform.................................................................................................................. 12
2.2.4. Walsh-Hadamard Transform ............................................................................................................... 14
2.2.5. Discrete Wavelet Transform ............................................................................................................... 15
2.3. Final Remarks .......................................................................................................................................... 17
Chapter 3 – Main Background Technologies: Adaptive Transform and Early HEVC .. 19
3.1. An Adaptive Transform for Improved H.264/AVC-Based Video Coding .......................................... 19
3.1.1. Objectives............................................................................................................................................ 20
3.1.2. Architecture and Walkthrough ............................................................................................................ 20
3.1.3. Details on the Adaptive Transform ..................................................................................................... 21
3.1.4. Performance Evaluation ...................................................................................................................... 27
viii
3.1.5. Summary ............................................................................................................................................. 28
3.2. Introduction to the High Efficiency Video Coding Standard ............................................................... 28
3.2.1. Objectives............................................................................................................................................ 28
3.2.2. Technical Approach ............................................................................................................................ 28
3.2.3. Transform and Quantization................................................................................................................ 31
3.2.4. Summary ............................................................................................................................................. 33
3.3. Final Remarks .......................................................................................................................................... 33
Chapter 4 – Adopted Coding Solution Functional Description and Implementation
Details ...................................................................................................................................... 35
4.1. Objectives .................................................................................................................................................. 35
4.2. Architecture and Walkthrough ............................................................................................................... 36
4.3. HEVC Framework Functional Description and Implementation Details ........................................... 39
4.4. AT Encoder Function Description and Implementation Details .......................................................... 41
4.4.1. Reference Frame Upsampling ............................................................................................................. 41
4.4.2. Frame Partitioning ............................................................................................................................... 43
4.4.3. Motion Compensation Prediction Block Computation ....................................................................... 44
4.4.4. Forward Adaptive Transform .............................................................................................................. 45
4.4.5. Quantization ........................................................................................................................................ 52
4.4.6. Entropy Encoder.................................................................................................................................. 53
4.5. AT Decoder Functional Description and Implementation Details ....................................................... 55
4.5.1. Entropy Decoder ................................................................................................................................. 55
4.5.2. Inverse Quantization ........................................................................................................................... 57
4.5.3. Inverse Adaptive Transform ................................................................................................................ 57
4.5.4. Frame Reconstruction ......................................................................................................................... 58
4.6. Summary ................................................................................................................................................... 58
Chapter 5 – Performance Evaluation ................................................................................... 59
5.1. Test Conditions ......................................................................................................................................... 59
5.1.1. Video Sequences ................................................................................................................................. 59
5.1.2. Coding Conditions .............................................................................................................................. 61
5.1.3. Performance Evaluation Metrics ......................................................................................................... 62
5.1.4. Coding Benchmarks ............................................................................................................................ 63
5.2. Results and Analysis ................................................................................................................................ 64
5.2.1. Performance for CIF Resolution Video Sequences ............................................................................. 64
5.2.2. Performance for HD Resolution Video Sequences ............................................................................. 77
5.3. Summary ................................................................................................................................................... 81
Chapter 6 – Conclusion.......................................................................................................... 83
6.1. Summary and Conclusions ...................................................................................................................... 83
6.2. Future Work ............................................................................................................................................. 84
ix
Appendix A – Transforms in Available Image/Video Coding Standards ......................... 85
Appendix B – Recent Advances on Transform Coding .................................................... 109
References ............................................................................................................................. 129
xi
Index of Figures
Figure 1.1 – Digital video on a mobile phone, on a computer and on a television set [1,2,3]. ............................... 1 Figure 1.2 – Ultra high definition television set [4]. ............................................................................................... 2 Figure 2.1 – Typical transform-based image coding architecture. .......................................................................... 6 Figure 2.2 – Example of block artifacts in a highly compressed image [5]. ........................................................... 6 Figure 2.3 – 888 video cube [7]. ......................................................................................................................... 9 Figure 2.4 – Example of image block with diagonal edges. ................................................................................. 10 Figure 2.5 – Samples rearrangement for a diagonal down-left edge. .................................................................... 10 Figure 2.6 – 88 DFT basis functions [9]. ............................................................................................................ 12 Figure 2.7 – 88 DCT basis functions [9]. ............................................................................................................ 13 Figure 2.8 – Example of DFT versus DCT reconstruction periodicity effects. ..................................................... 13 Figure 2.9 – Analysis filter architecture [13]. ....................................................................................................... 15 Figure 2.10 – Example of a three-level 1D-DWT decomposition architecture [13]. ............................................ 16 Figure 2.11 – Example of a two-level 2D-DWT decomposition [14]. .................................................................. 16 Figure 2.12 – Example of a three-level 2D-DWT decomposition [14]. ................................................................ 16 Figure 3.1 – General architecture of the adaptive transform video coding solution [17]. ..................................... 20 Figure 3.2 – Forward adaptive transform architecture. ......................................................................................... 21 Figure 3.3 – Inverse adaptive transform architecture. ........................................................................................... 22 Figure 3.4 – (a) Original block. (b) MCP block. (c) Corresponding prediction error block [15]. ......................... 23 Figure 3.5 – (a) Shifted and rotated MCP block (shift: -0.25 pixels vertically; rotation: -0.5°). (b) Difference
between the MCP block and the shifted and rotated MCP block [15]. ................................................................. 23 Figure 3.6 – Set of estimated prediction error blocks (shift: -0.5 to 0.5 pixels, horizontally and vertically;
rotation: -0.5°) [15]. .............................................................................................................................................. 23 Figure 3.7 – Covariance matrix for a set of estimated prediction error blocks [18]. ............................................. 24 Figure 3.8 – Block of covariance values for the pixel in row 3, column 0, with the pixels in all other positions
[18]. ....................................................................................................................................................................... 25 Figure 3.9 – MKLT basis functions for the example in Figure 7 [18]. ................................................................. 25 Figure 3.10 – MKLT and DCT coefficients for the previous example [18]. ......................................................... 26 Figure 3.11 – MKLT and DCT coefficients amplitude versus scan position [18]. ............................................... 26 Figure 3.12 – RD performance for the H.264 Standard and H.264 AT video coding solutions [15]. ................... 27 Figure 3.13 – Basic HEVC encoder architecture [24]. .......................................................................................... 29 Figure 3.14 – Illustration of a recursive CTB structure with LCTB size = 128 and maximum hierarchical depth =
5 [22]. .................................................................................................................................................................... 30 Figure 3.15 – Parameters defining the geometric partitioning of a PU [22]. ........................................................ 30 Figure 3.16 - Signal flow graph of Chen‟s fast factorization for an order-16 DCT [22]. ..................................... 32 Figure 4.1 – Architecture of the developed coding solution. ................................................................................ 37
xii
Figure 4.2 – Example of (a) PU partitioning and (b) TU partitioning of a 32×32 CTB. ....................................... 39 Figure 4.3 – TU depths for the CTB in Figure 4.2 (b). ......................................................................................... 40 Figure 4.4 – Coding modes (intra-coding = „0‟ and inter-coding = „1‟) for the CTB in Figure 4.2 (b). ............... 40 Figure 4.5 – (a) Horizontal and (b) vertical motion vectors values for the CTB in Figure 4.2 (a). ....................... 41 Figure 4.6 – Half and quarter-pixel motion positions illustration [22]. ................................................................ 42 Figure 4.7 – Upsampled reference frame illustration. ........................................................................................... 43 Figure 4.8 – Example of MCP block computation for a 4×4 TU. ......................................................................... 45 Figure 4.9 – MCP block for the example in Figure 4.8 after the downsampling operation. ................................. 45 Figure 4.10 – Architecture of the forward adaptive transform module. ................................................................ 46 Figure 4.11 – Adopted coordinate system for a 4×4 block. .................................................................................. 47 Figure 4.12 – Rotation of a 4×4 UMCP block by an angle θ around its origin. .................................................... 48 Figure 4.13 – Two vectors, v1 and v2, connecting the same point D to two different points, P1 and P2,
respectively. .......................................................................................................................................................... 49 Figure 4.14 – Block positions (blue) converted to the Euclidean space (red) for the block in Figure 4.11. ......... 50 Figure 4.15 – Shifts applied to a rotated UMCP block with a shift parameter equal to δ for the horizontal and
vertical directions. ................................................................................................................................................. 50 Figure 4.16 – Set of shifted and rotated UMCP blocks for all possible δ combinations (for each θ). .................. 51 Figure 4.17 – Architecture of the entropy encoder module. .................................................................................. 53 Figure 4.18 – LZ77 terminology considering the coding of the third character in the input symbol stream. ....... 54 Figure 4.19 – Architecture of the entropy decoder module. .................................................................................. 56 Figure 4.20 – Architecture of the inverse adaptive transform module. ................................................................. 57 Figure 5.1 – First frame of the selected CIF video sequences. .............................................................................. 60 Figure 5.2 – First frame of the selected HD video sequence: Kimono sequence. ................................................. 61 Figure 5.3 – Container sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ......................... 65 Figure 5.4 – Container sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS codecs. ....... 65 Figure 5.5 – Foreman sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ........................... 69 Figure 5.6 – Foreman sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS. .................... 70 Figure 5.7 – Mobile sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ............................. 73 Figure 5.8 – Mobile sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS. ....................... 74 Figure 5.9 – Kimono sequence RD performance for the DCT, MKLT HRS and MKLT FRS. ............................ 77 Figure 5.10 – Kimono sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS. ................... 78
xiii
Index of Tables
Table 3.1 - Approximated constants for an order-16 DCT [22]. ........................................................................... 32 Table 4.1 – 12-tap DCT-based interpolation filter coefficients [22]. .................................................................... 42 Table 4.2 – Reference QPs with the corresponding Qstep [34]............................................................................... 52 Table 5.1 – Selected QPs and their corresponding Qstep values. ............................................................................ 62 Table 5.2 – Container sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT. ................................................................................................................................................... 66 Table 5.3 – Container sequence percentage of inter-coded TUs for each QP and TU block size. ........................ 66 Table 5.4 – Container sequence percentage of TUs coded with the available transforms for each AT codec, QP
and TU block size. ................................................................................................................................................ 68 Table 5.5 – Foreman sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT. ................................................................................................................................................... 70 Table 5.6 – Foreman sequence percentage of inter-coded TUs for each QP and TU block size. ......................... 71 Table 5.7 – Foreman sequence percentage of TUs coded with the available transforms for each AT codec, QP
and TU block size. ................................................................................................................................................ 72 Table 5.8 – Mobile sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT. ................................................................................................................................................... 74 Table 5.9 – Mobile sequence percentage of inter-code TUs for each QP and TU block size. .............................. 75 Table 5.10 – Mobile sequence percentage of TUs coded with the available transforms for each AT codec, QP
and TU block size. ................................................................................................................................................ 76 Table 5.11 – Kimono sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT. ................................................................................................................................................... 78 Table 5.12 – Kimono sequence percentage of inter-coded TUs for each QP and TU block size. ......................... 79 Table 5.13 – Kimono sequence percentage of TUs coded with the available transforms for each AT code, QP and
TU block size. ....................................................................................................................................................... 80
xv
List of Acronyms
AT Adaptive Transform
CABAC Context-Adaptive Binary Arithmetic Coding
CAVLC Context-Adaptive Variable-Length Coding
CD Compact Disc
CfP Call for Proposals
CIF Common Intermediate Format
CTB Coding Tree Block
DCT Discrete Cosine Transform
DFT Discrete Fourier Transform
DVD Digital Versatile Disk
DWT Discrete Wavelet Transform
FRExt Fidelity Range Extensions
FRS Full Range shift and rotation parameters Set
GOP Group Of Pictures
HEVC High Efficiency Video Coding
HM HEVC test Model
HRS Half Range shift and rotation parameters Set
HVS Human Visual System
ICT Integer discrete Cosine Transform
ISND Integrated Services Digital Network
ITU-T International Telecommunication Union – Telecommunication Standardization Sector
JCT-VC Joint Collaborative Team on Video Coding
JPEG Joint Photographic Experts Group
KLT Karhunen-Loève Transform
xvi
LCTB Largest Coding Tree Block
LF Loop Filter
MB Macroblock
MCP Motion Compensated Prediction
MDDT Mode-Dependent Directional Transform
MICT Modified Integer discrete Cosine Transform
MKLT Modified Karhunen-Loève Transform
MPEG Moving Picture Experts Group
MV Motion Vector
NICT Non-orthogonal Integer discrete Cosine Transform
PSNR Peak Signal-to-Noise Ratio
PSTN Public Switched Telephone Network
PU Prediction Unit
QCIF Quarter Common Intermediate Format
QVGA Quarter Video Graphics Array
RD Rate-Distortion
ROT Rotational Transform
SCTB Smallest Coding Tree Block
SIF Source Input Format
SVD Singular Value Decomposition
TMuC Test Model under Consideration
TU Transform Unit
UMCP Upsampled Motion Compensation Prediction
VCEG Video Coding Experts Group
VLC Variable-Length Coding
VLI Variable Length Integer
VOP Video Object Planes
WHT Walsh-Hadamard Transform
1
Chapter 1
Introduction
This first chapter aims to introduce the motivation behind this Thesis. To do this, the relevant context is
introduced first, followed by the presentation of the emerging problem asking for an efficient solution. In this
context, the main objectives of this work are defined. Finally, the Thesis structure is described.
1.1. Context and Emerging Problem
Digital video has been a regular presence in our lives for many years now. Whether used for digital television, in
personal computers, hand held devices or other multimedia applications (see Figure 1.1), its use has grown
tremendously in the last years and it seems that this growth is not slowing down.
Figure 1.1 – Digital video on a mobile phone, on a computer and on a television set [1,2,3].
2
With the currently available transmission and storage supports, this growth is only possible with the use of
powerful compression tools allowing the reduction of the number of bits needed to represent the video content
by exploiting the data correlation and the limitations of the Human Visual System (HVS) to remove the
redundant and irrelevant data, respectively. These compression tools have been included in several video coding
standards defined by the International Telecommunication Union – Telecommunication Standardization Sector
(ITU-T) and the Moving Picture Experts Group (MPEG) over the last two decades. Currently, the H.264/AVC
coding standard, developed by the Joint Video Team (JVT) formed by the ITU-T Video Coding Experts Group
(VCEG) and ISO/IEC MPEG bodies, is considered the state-of-the-art in terms of video coding.
However, with the recent advances in video capturing and display technologies (see Figure 1.2), the presence of
High Definition (HD) and Ultra High Definition (UHD) video contents in various multimedia applications is
quickly increasing. Clearly, this type of video resolutions requires higher bandwidth for its transmission and
larger storage capacities. In this way, the compression ratios achieved by the current state-of-the-art video
coding standard for HD and UHD content do not seem enough taking in account the available transmission and
storage supports. With this in mind, the ITU-T VCEG and ISO/IEC MPEG bodies created the Joint
Collaborative Team on Video Coding (JCT-VC) which is currently developing a new video coding standard, the
High Efficiency Video Coding (HEVC) standard, with the objective of increasing the highest available
compression ratios, particularly for very high resolution video contents. To do this, new coding techniques have
to be developed that can guarantee better compression over the current ones even if at the price of some
additional complexity.
Figure 1.2 – Ultra high definition television set [4].
1.2. Objectives
In this context, this Thesis focuses on the design, implementation and assessment of a novel coding technique for
a particular data compression module: the transform coding. Transform coding is used since the first image and
video coding standards and it is still present in the current state-of-the-art video coding standard. The main
objective of transform coding is to remove the spatial redundancy present in a particular image or video frame by
transforming it from the spatial to the frequency domains. Since the HVS is less sensitive to the higher
frequencies than to the lower frequencies, this may also be an effective way to discard irrelevant data contained
in the higher frequency bands. Currently, all the available video coding standards make use of the Discrete
Cosine Transform (DCT), but in this work a novel coding solution is developed using a different transform
technique. This coding solution is intended to be used in the context of the emerging HEVC standard for high
and ultra high resolution video contents. In this way, this Thesis targets the following objectives:
Detailed review of the state-of-the-art on transform coding – First, it is necessary to make a detailed
review on the state-of-the-art on transform coding by studying its basic principles and concepts.
Study and implementation of the adopted transform coding technique – The second objective of
this work is to study and implement the adopted transform in order to allow its integration in a more
general video coding solution.
Study of the recent advances introduced in the HEVC standard – Then, the recent advances present
in the initial version of the emerging HEVC standard must be studied to allow the combination of the
adopted transform coding technique with this video codec.
3
Integration of the adopted transform coding technique in the HEVC context – With the previously
referred objectives achieved, it is then desirable to integrate (as much as possible) the adopted transform
coding technique in the HEVC codec, defining in this way the video coding solution adopted in this
Thesis.
Performance evaluation of the adopted video coding solution – Finally, the performance of the
developed coding solution is assessed to check its utility in the video coding context. Taking in account
the emerging problem considered and the target resolution of the HEVC standard, this evaluation is
intended to be made for high resolution video contents.
With the achievement of these objectives, it is possible to evaluate if the adopted video coding solution can be
useful for future application in the video coding context.
1.3. Thesis Structure
This Thesis is organized in six chapters and two appendixes, including this first chapter that is used to introduce
the work developed in this Thesis.
After this introductory chapter, Chapter 2 contains a review of the state-of-the-art on this Thesis main object of
study: transform coding. In this review, the reader is introduced to the basic principles and concepts on transform
coding. Additionally, the most important transforms are introduced, and their basic principles and features are
presented.
In Chapter 3, the two main technical elements behind the studies and implementations performed in this Thesis
are presented. First, a video coding solution making use of the transform coding technique adopted in this work
is reviewed, with natural emphasis on the proposed transform. Then, the currently under development HEVC
standard is presented.
Chapter 4 introduces the reader to the combined coding solution developed in this Thesis. To do this, the general
architecture of the adopted coding solution is presented and the functional description and implementation
details of its main modules are explained.
After describing the adopted coding solution in detail, Chapter 5 reports its performance evaluation. To do this,
the used test conditions are first defined. Then, the performance results obtained with these conditions are
presented and analyzed.
The last chapter of this work, Chapter 6, identifies the conclusions taken from the work developed in this Thesis
and provides some details on future work than can be done in its context.
In Appendix A, the details on the transform coding usage in the context of the available image and video coding
standards are presented.
Appendix B presents a review of some of the most relevant advances on transform coding.
5
Chapter 2
Reviewing the State-of-the-Art on
Transform Coding
This chapter contains a brief review of the state-of-the-art on transform coding. The chapter starts by reviewing
the basic concepts and principles on transform. Then, the most important transforms in the context of this Thesis
are presented in detail.
2.1. Basics on Transform Coding
Transform coding is one of the basic tools used in digital compression, notably image, video and also audio data.
In image and video compression, the transforms are mainly used to reduce the spatial redundancy by
representing the pixels in a frequency domain prior to data reduction through compaction and quantization.
Although this chapter will concentrate on reviewing transforms when applied with a coding/compression
purpose, transforms are a basic signal processing tool and, thus, they may be applied with other functional
purposes.
To achieve data compression, the original signal is decorrelated by using an appropriate transform, redistributing
its energy to a typically small number of transform coefficients, usually located in the low frequency region.
These coefficients can then be quantized with the aim of discarding perceptually irrelevant information, without
significantly affecting the subjective quality of the reconstructed/decoded image and video. Although the
transform process does not theoretically involve data losses, the closely associated quantization process is lossy,
since the original values cannot be recovered due to the associated quantization error. It may also happen that the
transform „becomes‟ lossy due to the numerical limitations associated to the transform implementation, e.g.
roundings and truncations. The transform operation in the context of a typical image codec is illustrated in
Figure 2.1.
6
Figure 2.1 – Typical transform-based image coding architecture.
As shown in Figure 2.1, the original signal is usually segmented into square blocks, typically with 8×8 samples.
Each block is then individually transformed, an operation known as block transform. With this block based
processing, it is possible to reduce the computational and storage requirements (i.e. the transform complexity)
when compared to transforming the whole image simultaneously. Transforming each block independently can
also capture local information better, exploiting the correlation between block samples in a more efficient way;
however, the correlation between blocks is typically poorly (or not) exploited. Moreover, this approach can
cause noticeable reconstruction errors at the block boundaries resulting in blocking artifacts, i.e., the boundaries
between adjacent blocks are highly visible (see Figure 2.2). This phenomenon occurs when the higher frequency
components required to reconstruct the sharp boundaries of each block are discarded or highly quantized. Thus,
the higher the compression ratio, the more noticeable the blocking artifacts become.
Figure 2.2 – Example of block artifacts in a highly compressed image [5].
From the compression point of view, an „ideal‟ transform should have the following characteristics:
Reversibility – A transform is reversible if the input signal can be recovered in its original domain after
applying the transform and its associated inverse transform without error (if no numerical constraints
exist). In image and video compression, this is an essential feature since the original data has to be
recovered in the spatial domain to be visualized.
Energy compaction – Energy compaction regards the capability to reduce the number of energy
elements without any loss of information by removing existing redundancy. This means that the ideal
transform must concentrate the original signal energy in the smallest number of coefficients possible.
Decorrelation – Decorrelated coefficients are coefficients that do not transmit the same information;
this assures that each coefficient carries additional information with no or small repetition and, thus, it
always adds value by itself.
Data-independent – A data-independent transform is a transform that is independent of the input signal;
ideally, the transform should achieve good compression efficiency for most image types. While it is
natural that the optimal transform depends on the input signal properties, the computational complexity
to find this optimal transform and the overhead required to transmit it to the decoder is not typically
practicable and desirable.
Low complexity – The complexity of a transform is related with the computational resources required to
perform it, e.g., the number of operations required; it is naturally desirable that a transform can be
7
performed with the lowest possible computational complexity and this may require the development of
fast transform implementations.
These characteristics have been largely adopted as the requirements for the choice of the transform adopted in
existing image and video compression standards. Next, the most important properties regarding transforms used
for image and video compression are identified.
2.1.1. Unitary Transforms
A unitary transform of a input data vector x is defined by
(2.1)
where B is a unitary square matrix and y is the vector with the transform coefficients. A square matrix is unitary
if its inverse is equal to its conjugate transpose, i.e., B-1
= B*T
. If a unitary matrix only has real entries, i.e., B =
B*, then its inverse is equal to its conjugate, B-1 = BT, and it is known as an orthogonal matrix.
The column and row vectors of a unitary matrix are orthogonal (perpendicular to each other) and normalized (of
unit length), i.e., orthonormal. This can be defined by
(2.2)
where bk is the kth
column of the unitary matrix B.
The vectors bk constitute a set of orthonormal basis vectors. Basis vectors are a set of vectors which can be
linearly combined to represent any vector in a given vector space. In a similar way, basis functions are a set of
functions than can be linearly combined to represent any function in the function space. In this case, the unitary
matrix B represents the unitary transform basis functions.
Unitary transforms have very interesting properties, notably in terms of image compression:
Reversibility – As mentioned above, the unitary transform basis functions assure the reversibility of
these transforms (B-1
= B*T
).
Energy conservation – All the energy from the input signal is preserved in the transform coefficients,
i.e.,
(2.3)
Energy compaction – Unitary transforms tend to pack a large fraction of the signal energy into just a
few transform coefficients.
Decorrelation – Most unitary transforms assure the decomposition of the initial signal into reasonably
uncorrelated transform coefficients.
Following these properties and the ideal properties for a transform to be used for compression as describe above,
the unitary transforms are the usual choice for the transforms used in image and video compression standards.
2.1.2. One-Dimensional Transforms
Considering x(n) a block of N input samples (spatial-domain), like in a speech or audio signal, and y(k) a set of N
transform coefficients (frequency-domain), a one-dimensional (1-D) transform is given by
(2.4)
where a(k,n) are the forward transform basis functions. The inverse transform used to recover the original signal
is defined by
(2.5)
where b(k,n) are the inverse transform basis functions.
8
Taking into consideration that the first basis vector typically corresponds to the „zero‟ frequency component, it
corresponds to a constant function and, thus, y(0) is known as the DC coefficient, which represents the mean
value of the waveform under transform (for the block transformed). This is the most important transform
coefficient since it is associated to the lowest frequency to which the human perception systems are typically
very sensitive. All the other transform coefficients are known as AC coefficients.
2.1.3. Two-Dimensional Transforms
Considering now x(m,n) a two-dimensional (2-D) NN array of samples, like in an image signal, and y(k,l) a
NN array of transform coefficients, the forward and inverse 2-D transforms are given by
(2.6)
(2.7)
where a(k,l,m,n) and b(k,l,m,n) are the forward and inverse transform basis functions, respectively.
There are two important classes of 2-D transforms: non-separable 2-D transforms and separable 2-D transforms.
A non-separable 2-D transform is performed by simply using the N columns (or rows) of the input array end to
end to form a single column vector of length N2, and then performing the transform in Eq. (2.1). Non-separable
2-D transforms exploit both the horizontal and vertical correlations in the input signal and typically require N4
arithmetic operations [6].
In a separable 2-D transform, both transform basis functions are separated into separate, horizontal (row) and
vertical (column), operations:
(2.8)
(2.9)
With these operations, a separable 2-D transform can be performed in two independent steps, applied one after
the other (and not jointly to both directions). The first step uses the horizontal basis function, , exploiting
the horizontal correlation in the data while the second step uses the vertical basis function, , exploiting
the vertical correlation in the data.
The separable 2-D transforms are implemented as two consecutive 1-D transform operations given by
(2.10)
In matrix notation
(2.11)
For symmetrical basis functions, this means basis functions which are similar in the vertical and horizontal
directions, Av = Ah = A, it comes
(2.12)
(2.13)
The multiplication of two N×N matrices requires N3 arithmetic operations (N arithmetic operations for each
entry of the final result matrix). Therefore, a separable 2-D transform, which has two matrix multiplications,
requires 2N3 arithmetic operations [6] (against the N
4 operations of non-separable transforms). Thus, for the
usual case where N 2, a separable 2-D transform is normally preferred in terms of implementation complexity.
9
2.1.4. Three-dimensional Transforms
Consider now x(m,n,p) a three-dimensional (3-D) NNN input signal. This signal has two spatial components
and one temporal component, forming a NNN cube. Figure 2.3 shows an illustration of a 888 video cube
formed by 8 frames, each providing a 88 data block.
Figure 2.3 – 888 video cube [7].
Considering y(k,l,q) the transform coefficients, the forward and inverse 3-D transforms are given by
(2.14)
(2.15)
where a(k,l,q,m,n,p) and b(k,l,q,m,n,p) are the forward and inverse transform basis functions, respectively.
With a 3-D transform, it is possible to exploit the correlation between the samples in the three main dimensions,
two in space and one in time. Particularly for video compression, it is possible to remove not only the spatial
redundancy (intra-frame coding), but also simultaneously the temporal redundancy (inter-frame coding).
Naturally, using this type of transform for a video sequence will cause a coding delay depending on the number
of frames accumulated to perform the 3-D transform. A particular 3-D transform is presented with more detail in
Section B.3.
2.1.5. Directional Transforms
A directional transform is a transform that uses information about the edges present in certain input data to
better exploit the correlation between the samples. The objective of these transforms is to improve its coding
performance by detecting and removing more spatial redundancy than non-directional transforms, increasing the
compression ratio for the target quality.
As shown in the previous sections, a separable 2-D transform is independently implemented through two 1-D
transforms: one along the vertical direction (along the input data columns) and another along the horizontal
direction (along the input data rows). This approach is very useful since both vertical and horizontal directions
are important according to the HVS; it is also useful in cases where the input data has important horizontal
and/or vertical edges. However, for data containing other directional edges - a typical situation in many image
signals - a separable 2-D transform may not be the best choice. As an example, consider the image block
presented in Figure 2.4, where a diagonal line divides two uniform regions. In this case, a separable 2-D
transform would generate a rather high number of non-zero AC coefficients, deteriorating the transform
compression performance in terms of energy compaction.
10
Figure 2.4 – Example of image block with diagonal edges.
In these situations, directional transforms may be used and useful. There are various kinds of directional
transforms, notably:
Mode-dependent directional transform – One approach is to store a set of different basis functions,
each one suitable for a specific edge direction. After detecting the edge direction or the most relevant
edge direction of a given input data, the corresponding basis functions are used to perform the 2-D
transform.
1-D directional transform, followed by a 1-D horizontal transform – With this approach, the first
step is to perform a 1-D transform along the direction of the input data edge. The second step is to
perform a 1-D horizontal transform, since the first row contains all DC coefficients and each of the other
rows contains all AC coefficients with the same index.
Directional ordering of the data block, followed by a separable 2-D transform - Another approach is
to rearrange the samples in the input data according to its directional edge (see Figure 2.5). After, a 1-D
transform is performed along the columns and the rows of the rearranged data, similarly to the separable
2-D transform process.
Figure 2.5 – Samples rearrangement for a diagonal down-left edge.
This subject is addressed with more detail in Section B.2, for a particular directional transform solution.
2.2. Most Important Transforms
In the following section, some unitary transforms of interest are presented, notably the Karhunen-Loève (KLT),
the Discrete Fourier (DFT), the Discrete Cosine (DCT), the Walsh-Hadamard (WHT) and the Discrete Wavelet
(DWT) transforms.
2.2.1. Karhunen-Loève Transform
The Karhunen-Loève Transform (KLT) is a unitary and orthogonal transform. It is non-separable and the
forward and inverse 1-D KLT for a vector x are defined by
(2.16)
where the matrix represents the KLT basis functions. The KLT does not have a fixed set of basis functions
since they depend on the original data. The KLT basis functions are determined with the following steps:
1. Covariance matrix of the input data computation – The covariance matrix Σ is defined as
Σ cov E (2.17)
where
11
E (2.18)
is the expected value of the ith
entry in the vector x.
2. Eigenvectors1 and eigenvalues
2 of the covariance matrix computation - Compute the matrix of
eigenvectors of the covariance matrix Σ
Σ (2.19)
where is the diagonal matrix of eigenvalues of the covariance matrix Σ, i.e.,
(2.20)
where λm is the mth eigenvalue of the covariance matrix Σ. The columns of matrix correspond to the
eigenvectors of the covariance matrix Σ, representing the KLT basis functions.
The main KLT advantage is:
Best energy compaction – The KLT is theoretically the best transform in terms of energy compaction
when compared to other transforms. The KLT is able to pack more signal energy in the same fraction of
coefficients or to pack a certain fraction of the total energy in the smallest number of coefficients.
The main KLT drawbacks are:
Data-dependent – The KLT uses data-dependent basis functions; this implies the continuous
computation of the input signal covariance matrix as well as its storage and transmission.
High complexity – The high number of operations required to determine the KLT basis functions
significantly increases its complexity.
The use of the KLT for image and video compression is rather uncommon as it fails to fulfill two of the
characteristics typically asked to an ideal, efficient transform: data-independent basis functions and low
complexity. The KLT is also known as Principal Component Analysis (PCA) and it may be used as a tool in
exploratory data analysis and predictive models, where it is essential to have the best performance possible in an
energy packing-sense [8].
2.2.2. Discrete Fourier Transform
The Discrete Fourier Transform (DFT) is a unitary and orthogonal transform that is used to decompose the
original data into its sine and cosine components. It is a 2-D unitary transform and its forward and inverse
versions are defined by
(2.21)
(2.22)
for an NN block of data samples.
The DFT basis functions correspond to sine and cosine waves with increasing frequencies. As noted before, the
first coefficient, y(0,0), represents the DC-component of the corresponding data; for example, for the image
luminance, it corresponds to its average brightness/luminance.
As the DFT is a separable transform, its basis functions can be represented as the product of two 1-D transforms,
given by
1 Formally, if A is a linear transformation, a non-null vector x is an eigenvector of A if there is a scalar such
that Ax = x.
2 The scalar is said to be an eigenvalue of A corresponding to the eigenvector x.
12
(2.23)
The 88 DFT basis functions are visually shown in Figure 2.6.
Figure 2.6 – 88 DFT basis functions [9].
For a N-length vector, computing a 1-D DFT requires N2 arithmetic operations. To reduce the complexity of this
transform, a fast DFT implementation is often used, which is well known as the Fast Fourier Transform (FFT). A
FFT algorithm can reduce the number of arithmetic operations to only NlogN for a 1-D DFT [10].
The main DFT advantage is:
Fast implementation – Using a FFT algorithm, it is possible to significantly reduce the DFT
complexity; this is a great advantage in comparison to the KLT. When compared to the next transforms,
this is not very significant as all of them also have fast algorithms to simplify their implementation.
The main DFT drawback is:
Complex coefficients – The DFT produces complex coefficients, with real and imaginary parts, this
means with magnitude and phase; the storage and manipulation of these complex values may be a
disadvantage when compared to other available transforms, e.g. the DCT which use real (and not
complex) numbers.
The DFT is not usually used for image and video compression, as there are other transforms considered to be
more appropriate, e.g. the DCT. Instead, it is widely used for spectrum analysis, to solve partial differential
equations and to perform other operations such as convolutions or multiplying large integers.
2.2.3. Discrete Cosine Transform
The Discrete Cosine Transform (DCT) is a unitary and orthogonal transform, conceptually rather similar to the
DFT but only using real numbers (and not complexes anymore). For an NN block of samples, the forward 2-D
DCT is defined by
(2.24)
and the inverse 2-D DCT is defined by
(2.25)
with
13
(2.26)
Like the DFT, since the DCT is also a separable transform, it can be represented as the product of two 1-D
DCTs. The 88 2-D DCT basis functions are visually shown in Figure 2.7.
Figure 2.7 – 88 DCT basis functions [9].
Since the cosine function is real and even, i.e., cos(x) = cos(-x), and the input signal is also real, the inverse DCT
generates a function that is even and periodic in 2N, considering N the length of the original signal sequence. In
contrast, the inverse DFT produces a reconstruction signal that is periodic in N; these effects are illustrated in
Figure 2.8. In Figure 2.8, the original sequence in (a) is transformed and reconstructed in (b) by using a forward-
inverse DFT pair and in (c) by using a forward-inverse DCT pair. The periodicity of the inverse DCT is 10
samples long, twice as long as the periodicity of the inverse DFT. It can be noted that the DCT reconstruction
introduces less severe discontinuities at the end of the sequence than the DFT reconstruction. The importance of
this DCT property is that reconstruction errors at the blocks boundaries, and consequent blocking artifacts, are
less severe in comparison to those of the DFT.
Figure 2.8 – Example of DFT versus DCT reconstruction periodicity effects.
For highly correlated signals, the DCT compaction performance comes very close to the KLT performance.
However, unlike the KLT, the DCT basis functions are not data-dependent, avoiding the computation of the data
covariance matrix, along with its storage and transmission.
14
There are also many fast DCT implementation algorithms, notably the Fast Cosine Transform algorithm
(FCT).These algorithms can perform a 1-D DCT for a vector with length N with NlogN arithmetic operations
[11].
The main DCT advantages are:
Fast implementation with only real computations – Like the DFT, the DCT can be implemented
using fast algorithms which can greatly reduce the number of operations and, thus, its computational
complexity. In addition to this, the DCT only requires real computations, avoiding the manipulation of
complex numbers as in the DFT.
Reduced blocking artifacts – The DCT properties in terms of its periodicity help avoiding border
discontinuities; this may considerably reduce the appearance of blocking artifacts.
With these advantages and no significant drawbacks, the DCT is by far the most widely used transform for
image (e.g. JPEG standard) and video compression (e.g. ITU-T H.26x recommendations and MPEG standards).
2.2.4. Walsh-Hadamard Transform
The Walsh-Hadamard Transform (WHT) is a unitary and orthogonal transform. It is separable and its forward
and inverse 1-D transforms for a vector x with length 2m are defined by
(2.27)
where the matrix Hm represents the WHT basis functions. The matrix Hm is a 2m2
m Hadamard matrix, i.e., a
square matrix whose entries are either +1 or -1 and whose rows are mutually orthogonal, given by
(2.28)
where
is a normalization factor.
Some examples of these matrices for various block sizes are
(2.29)
(2.30)
(2.31)
The Fast Walsh-Hadamard Transform (FWHT) is an efficient algorithm to compute the WHT. A FWHT
algorithm can reduce the number of required arithmetic operations to compute a 1-D WHT from N2 to NlogN
[12].
The main WHT advantage is:
Fast and simple implementation – The Hadamard transform matrices are purely real, containing
values that are either +1 or -1. In this way, the WHT only has to perform very simple real operations,
significantly reducing the transform‟s complexity. Moreover, with the usage of a FWHT algorithm, the
WHT is considered the best transform from a complexity point of view.
The main WHT drawback is:
Modest energy compaction – From an energy compaction perspective, the WHT is not as efficient as
alternative transforms like the DCT; in fact, compared to all the other transforms presented in this
chapter, the WHT has the worse compaction performance [6].
The WHT is used in many signal processing and data compression algorithms, mainly because of its fast
implementation. In video compression, it may be used as a secondary transform, e.g. applied on the primary
15
transform DC coefficients to obtain even more compression in smooth regions, like in the H.264/AVC video
compression standard.
2.2.5. Discrete Wavelet Transform
The Discrete Wavelet Transform (DWT) is a unitary, orthogonal and separable transform that is usually applied
to the whole input data (or large parts of it called tiles) but typically not to small data blocks like all the
previously reviewed transforms. The DWT of an input signal x is computed by passing it through a series of
filters. First, the input samples are decomposed using a low-pass filter, g, i.e., a filter that passes low-frequency
signals but attenuates the high-frequency ones, and a high-pass filter, h, i.e., a filter that passes high-frequency
signals but attenuates the low-frequency ones. This operation is given by
(2.32)
where and are the low-pass and high-pass band coefficients, respectively.
The filters g and h must be closely related to each other in order to split the input signal into two bands, forming
a quadrature mirror filter, i.e.,
(2.33)
where f is the frequency. This property assures there is no information loss in the decomposition process.
With the operation in Eq. (2.32), half the signal frequencies are removed in both bands. In this way, according to
the sampling theorem3, half the samples can also be discarded and the outputs of the two filters, g and h, can be
subsampled by 2. This operation is given by
(2.34)
The filter analysis process described above is illustrated in Figure 2.9.
Figure 2.9 – Analysis filter architecture [13].
After this process, most the energy is usually located in the low-pass band. To increase the frequency resolution
in this band, further decompositions can be performed, repeating the operation in Eq. (2.34); this is illustrated in
Figure 2.10.
3 If a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates
at a series of points spaced 1/(2B) seconds apart.
16
Figure 2.10 – Example of a three-level 1D-DWT decomposition architecture [13].
Thus, for 1-D input signals, the successive application of the filters on the low-pass outputs results in a dyadic
decomposition, i.e., the number of coefficients for each novel lower band is half the number for the previous
decomposition.
As for 2-D input signals, the number of coefficients for each novel lower band is a quarter of the number for the
previous decomposition (2-D dyadic decomposition). In Figure 2.11, an explanation of a two-level DWT
decomposition for a 2-D input signal is shown; moreover, Figure 2.12 shows an example of a three-level DWT
decomposition for a real image.
Figure 2.11 – Example of a two-level 2D-DWT decomposition [14].
Figure 2.12 – Example of a three-level 2D-DWT decomposition [14].
As shown in Figure 2.11, the 2-D DWT results from applying a 1-D DWT first to the rows and after to the
columns of the input signal, which is typical of a separable 2-D transform.
The DWT is resolution-scalable, i.e., its coefficients allow the reconstruction of multiple spatial resolutions. For
example, in Figure 2.11, considering an NN image as the input signal, there are 3 different spatial resolutions
that can be recovered by the decoder:
LL2, with resolution
×
,
LL1 = LL2 + LH2 + HL2 + HH2, with resolution
×
and
LL0 = LL1 + LH1 + HL1 + HH1, with resolution .
Like other transforms, there are also many algorithms to perform a DWT in a more computationally efficient
way. These algorithms, known as Fast Wavelet Transform (FWT) algorithms, can compute a 1-D DWT for a
vector with length N with only N arithmetic operations [13].
17
The main DWT advantages are:
No blocking artifacts – Without any blocks since the transform is applied to the full image, there are
naturally no blocking artifacts in the decoded image.
Higher compression ratio – Transforming the whole input signal allows exploiting the correlation
between all neighbor samples and not only between samples of the same data block; this typically
allows reaching higher compression ratios.
Resolution-scalable – With the dyadic decompositions used in the DWT, it is possible to increase or
decrease the spatial resolution of the recovered data by simply increasing or decreasing the number of
coefficients decoded; this quality and spatial resolution scalability features are very useful for image
(and video) compression.
The main DWT drawback is:
High complexity – Performing a transform on the whole input signal, instead of dividing it in smaller
blocks, has a higher cost in terms of complexity. With a larger number of input samples, the number of
operations required to perform the transform also increases. This makes the DWT complexity
considerably high, despite having the most efficient fast algorithms in comparison to all the other
transforms presented in this chapter [13].
The DWT is mainly used for signal compression, particularly image and video compression (e.g. JPEG 2000
standard); it is also used for signal analysis, e.g., voice or even seismic data.
2.3. Final Remarks
In this chapter the basics on transform coding were presented. For details on the transform coding usage in the
context of the available image and video coding standards refer to Appendix A. Moreover, Appendix B presents
a review of some of the most relevant recent advances on transform coding.
The next chapter introduces two essential technical elements for the development of the adopted coding solution:
the adopted transform coding solution and the HEVC standard.
19
Chapter 3
Main Background Technologies: Adaptive
Transform and Early HEVC
The main purpose of this chapter is to present the two main technical elements which are behind the
implementation and studies presented in the next chapters. The first main background technical element is an
adaptive transform (AT) proposed in 2010 by Biswas et al. [15] to improve the video coding performance in the
context of the H.264/AVC standard. This adaptive transform is based on the KLT applied to prediction error
blocks and does not require its associated basis functions to be encoded and after transmitted to the decoder as
they are also estimated there. The main concepts and algorithms behind this technique are explained with more
detail in the first section of this chapter. The second main background technical element is the High Efficiency
Video Coding (HEVC) standard which is currently under development by JCT-VC group which was jointly
created by MPEG (ISO/IEC) and VCEG (ITU-T); the main objective of this recent standardization initiative is to
develop a new video codec for high and ultra high definition content with around 50% better compression
efficiency than the best H.264/AVC profile, the High profile.
3.1. An Adaptive Transform for Improved H.264/AVC-Based Video Coding
The spatial transform has always been a basic coding tool in all video coding standards developed in the past
decades. For most cases, a DCT has been adopted meaning that both the encoder and decoder know, since the
very beginning, which transform basis functions should be used. A main drawback of this type of solution is that
the transform basis functions do not consider the specific content to be coded and thus do not adapt to it,
reducing the energy compaction capabilities. However, there is the advantage that the transform basis functions
do not have to be neither computed nor transmitted.
An alternative solution is to adopt an adaptive transform which basis functions change depending on the content.
A solution following this principle is presented in [15] and will be adopted in this Thesis considering the
demonstrated compression performance. The authors propose a video coding solution allowing to adaptive select
20
the usual DCT or a modified KLT (MKLT), depending on the block to be coded. This solution allows adapting to
the block content without the burden of transmitting the KLT basis functions as they are equally estimated at
both the encoder and decoder sides. This method has been integrated in the H.264/AVC video coding
architecture to assess its performance against the standard H.264/AVC codec.
3.1.1. Objectives
As noted in Chapter 2, the KLT is the optimal transform in terms of energy compaction. Still, all the currently
available video coding standards make use of the DCT to represent the video information in the frequency
domain. This choice is due to the fact that the DCT, unlike the KLT, does not require the computation, coding
and transmission to the decoder of its basis functions for each block (i.e. it is data-independent) and can achieve
a near-optimal compression efficiency for highly correlated signals. A study presented in [16] reports that the
KLT improvements from an energy packing perspective when compared to the DCT are virtually lost by the
extra bits needed to represent its basis functions.
In [15], Biswas et al. propose a coding solution using an adaptive transform which allows a dynamic selection
between the DCT and a MKLT, depending on the block content. This solution does not require coding and
transmitting the MKLT basis functions. Instead, they are estimated in both the encoder and decoder using the
same technique, thus assuring equivalent transform basis at both ends of the coding chain. In this way, it is
possible to exploit the optimal behavior of the KLT, particularly for blocks which are hard to code using the
DCT (e.g. blocks with diagonal edges).
As the proposed KLT-based technique is only applicable to prediction error blocks, this adaptive transform
solution can only bring compression improvements for inter-coded blocks. This limitation is imposed by the
characteristics of the technique which will be later described.
3.1.2. Architecture and Walkthrough
As referred above, the AT solution was designed to be integrated in the standard H.264/AVC codec and improve
its compression efficiency. In this context, the main architecture of the proposed video codec is basically the
same as the H.264/AVC architecture (see Section A.8), with the exception of the forward and inverse transform
modules; however, as the bitstream syntax and semantics and the decoding behavior change, there is no
compatibility with H.264/AVC. The general architecture of this solution is shown in Figure 3.1.
Figure 3.1 – General architecture of the adaptive transform video coding solution [17].
21
A step-by-step walkthrough of the encoding process is presented next:
Macroblock splitting – First, the input video is split in 16×16 macroblocks as usual in H.264/AVC.
For the proposed coding solution, the authors use the FRExt extensions, implying that also 8×8 blocks
are available. However, in this case only 8×8 blocks are used for the transform operation (no motivation
is provided for this solution by the authors).
Transform – To transform each input block, the encoder decides whether to use the standard
H.264/AVC DCT (Integer DCT) or the proposed MKLT. This choice is made in a rate-distortion
optimized manner and is signaled to the decoder using only 1 bit for each coded block. In the next
section, the proposed adaptive transform is explained in detail.
Quantization – The transform coefficients (DCT or MKLT) for each block are then quantized in the
standard H.264/AVC way.
Entropy encoder – Finally, the quantized coefficients are entropy coded using CAVLC. For the DCT
transformed blocks, the standard H.264/AVC scanning orders are used (i.e. zigzag and alternate scans),
while for the MKLT blocks the coefficients are arranged from the highest to the lowest variance into
four 4×4 blocks which are then passed to the entropy encoder.
As for the decoder, the only difference regarding the H.264/AVC video coding standard is related to the
transform module, which will be described in the next section. The choice between the inverse DCT and the
inverse MKLT is made according to the information included in the bitstream by the encoder for each block.
3.1.3. Details on the Adaptive Transform
As referred before, the proposed adaptive transform video coding solution makes use of one of two transforms:
the DCT and a novel MKLT [15]. This adaptive transform is only applicable to inter-coded blocks, where only
the prediction error is transformed and quantized. The architectures for the forward and inverse adaptive
transforms are shown in Figure 3.2 and Figure 3.3, respectively.
Figure 3.2 – Forward adaptive transform architecture.
As shown in Figure 3.2, the forward adaptive transform consists basically on the computation of both the
forward DCT and the forward MKLT, followed by the selection of the transform which offers the best rate-
distortion performance. To compute the MKLT, the prediction error has to be estimated and the MKLT basis
functions computed based on the estimated prediction error, as it will be described below.
22
Figure 3.3 – Inverse adaptive transform architecture.
At the decoder side, the inverse adaptive transform consists on the computation of an inverse DCT or an inverse
MKLT, depending on which transform was selected in the encoding process, as shown in Figure 3.3. Again, to
compute the MKLT, the prediction error has to be estimated and the MKLT basis functions computed based on
the estimated prediction error, as it will be described below.
While the DCT is the same as already used in H.264/AVC for 8×8 blocks, this means an order-8 (i.e. 8×8) ICT
(see Section A.8.3), the MKLT, although similar to a standard KLT (Section 2.2.1), has some special features
which will explained below. With this in mind, the next section is dedicated to that transform.
From the observation of both the forward and inverse adaptive transform architectures, it is possible to conclude
that the Modified Karhunen-Loève Transform (MKLT) process includes 3 main modules (see colored blocks in
Figure 3.2 and Figure 3.3): prediction error estimation, MKLT basis functions computation and MKLT
computation (whether it is a forward KLT, in the forward AT case, or an inverse KLT, for the inverse AT case).
Therefore, these 3 modules are described in the following.
1) Prediction error estimation module
The prediction error is the difference between the original block and the Motion Compensated Prediction (MCP)
block, which in H.264/AVC is coded using motion vectors associated with one or multiple reference frames. In
this context, the prediction error (in the spatial domain) is indispensable to compute the standard KLT basis
functions. However, although the original prediction is available to the encoder, it is not available to the decoder.
This makes it impossible to compute the actual prediction error basis function at the decoder side and thus to use
the standard KLT. In this context, the only possibility to avoid the transmission of the basis functions to the
decoder and the consequent bitrate needs, is to estimate the prediction error. To do that, Biwas et al. [15] assume
that the prediction error is caused by errors in the motion estimation process, particularly:
Interpolation errors – In the motion compensation process, some errors can occur when
interpolating the reference frame pixels for quarter-pixel accuracy.
Imprecise edges prediction – In blocks with strong diagonal edges, the motion vectors may not be
predicted with full accuracy, thus causing small shifts in the location of the edges between the original
and the MCP block.
Following these assumptions, Biwas et al. [15] propose the estimation of the prediction error by simulating these
conditions. This is done by subtracting shifted and rotated versions of the MCP block from the MCP block itself
which plays here the role of the „original‟ data. The use of the MCP block for this purpose is natural as it is the
only piece of information that is simultaneously available at both the encoder and decoder. To exemplify this
operation, Figure 3.4 shows an original 8×8 block (a), its corresponding MCP block (b) and the prediction error
block (c). The MCP block is then shifted vertically by -0.25 pixels and rotated by -0.5°, resulting in the block
shown in Figure 3.5 (a). To complete the operation, the shifted and rotated MCP block is then subtracted from
the MCP block, Figure 3.5 (b).
23
Figure 3.4 – (a) Original block. (b) MCP block. (c) Corresponding prediction error block [15].
Figure 3.5 – (a) Shifted and rotated MCP block (shift: -0.25 pixels vertically; rotation: -0.5°). (b) Difference
between the MCP block and the shifted and rotated MCP block [15].
Despite the sign change when compared to the actual prediction error, see Figure 3.4 (c), the correlation between
the pixels in the estimated prediction error, Figure 3.5 (b), seems similar to the actual inter-pixel correlation in
„true‟ prediction error. This can be useful since the KLT basis functions are computed from the covariance
matrix of the input (error) block. To allow the exploitation of the above described prediction error properties in
the various directions, Biwas et al [15] propose the following shifts and rotations of the MCP block for the
prediction error estimation:
Shifts – The MCP block is shifted horizontally and vertically by 0.0, ±0.25 and ±0.5 pixels.
Rotations – The MCP block is rotated by 0.0° and ±0.5°.
In [15], Biwas et al. do not explain what criterion was used to define the maximum shift and rotation parameters
(0.5 pixels and 0.5°, respectively). This is one of the reasons why other maximum parameters will be tested later
in this Thesis. The combination of all 5 shift parameters along the horizontal and vertical directions results into
25 shifted MCP blocks (5×5=25). These shifted MCP blocks can then be rotated with 3 different rotation
parameters (-0.5°, 0.0° and 0.5°), resulting into a set of 75 shifted and rotated MCP blocks (25×3=75). Then, the
difference between the actual MCP block and the set of shifted and rotated MCP blocks is computed in order to
obtain a set of 75 estimated prediction error blocks. As an example, consider Figure 3.6 where a set of 25
estimated prediction error blocks is shown; in this case, only the results for a -0.5° rotation are shown.
Figure 3.6 – Set of estimated prediction error blocks (shift: -0.5 to 0.5 pixels, horizontally and vertically;
rotation: -0.5°) [15].
24
With a set of estimated prediction error blocks, it is then possible to compute the MKLT basis functions.
2) MKLT basis functions computation module
As previously referred, the KLT is a unitary and orthogonal transform; however, unlike the DCT, it is non-
separable. Thus, to transform a two-dimensional block, it is necessary to first convert the given block in a
column or row vector. Then, the covariance matrix of the vector must be computed to determine after its
eigenvectors, whose columns represent the basis functions of the transform. This process was described with
more detail in Section 2.2.1.
The MKLT proposed by Biwas et al [15] inherits the KLT characteristics referred above; however, in this case,
there are multiple input blocks representing a set of estimated prediction error blocks (as the „true‟ prediction is
not a available at the decoder). To determine the covariance matrix of this set, it is necessary to define the
covariance between each pixel position. Thus, the covariance between a pixel in position (u,v) and a pixel in
position (r,s) for a set of n×n blocks is given by
(3.1)
where u, v, r, s = 0…(n-1), j = u+n.v, k = r+n.s, Ei(u,v) is the estimated prediction error in position (u,v) of the ith
block, is the mean value and N is the number of blocks in the set.
Returning to the example shown above in Figure 3.4, Figure 3.5 and Figure 3.6, it is possible to determine the
covariance matrix for a set of estimated prediction blocks using Eq. (3.1), which is shown in Figure 3.7.
Figure 3.7 – Covariance matrix for a set of estimated prediction error blocks [18].
The row outlined in Figure 3.7 shows the covariance of the pixel in row 3, column 0, (considered here as the
reference pixel) with the pixels in all other positions. Rearranging this row to its original two-dimensional form
results in the covariance values block shown in Figure 3.8 where the red asterisk signals the reference pixel
position.
Covariance Matrix (scaled)
10 20 30 40 50 60
10
20
30
40
50
60
Covariance Matrix (scaled)
10 20 30 40 50 60
10
20
30
40
50
60
25
Figure 3.8 – Block of covariance values for the pixel in row 3, column 0, with the pixels in all other positions
[18].
Observing Figure 3.8, it is possible to conclude that the covariance with the reference pixel is higher for the
pixels in the direction of the edge, whether it is a positive covariance (in the same location of the reference pixel)
or a negative covariance (on the other edge of the block).
With the covariance matrix for a particular set of estimated prediction error blocks available, it is then possible
to determine the associated eigenvectors and eigenvalues
Σ (3.2)
where is the matrix of eigenvectors and is the diagonal matrix of eigenvalues of the covariance matrix Σ.
Subsequently the transpose matrix of the eigenvectors matrix is computed resulting in the MKLT basis
functions. For the example above, the basis functions are illustrated in Figure 3.9.
Figure 3.9 – MKLT basis functions for the example in Figure 3.7 [18].
Covariance with pixel at (row=3,col=0)
*
Covariance with pixel at (row=3,col=0)Covariance with pixel at (row=3,col=0)
*
26
In Figure 3.9, the set of basis functions is arranged in a horizontal raster scan order where it is possible to see
that the first basis functions (upper-left corner) show a subjective similarity to the actual prediction error.
3) MKLT computation module
After the determination of the MKLT basis functions, it is then possible to actually compute the MKLT both at
the encoder and decoder. Thus, the forward MKLT and the inverse MKLT are given by
(3.3)
where x is the actual prediction error (arranged in column vector), TMKLT are the MKLT basis functions, cMKLT are
the MKLT coefficients,
are the quantized MKLT coefficients and x‟ is the reconstructed prediction error.
Returning to the previous example, the MKLT coefficients of the actual prediction error block are shown in
Figure 3.10 alongside the DCT coefficients for the same block. It has to be noted that this block has strong
diagonal edges for which the DCT (because of its separable property) does not behave so well.
Figure 3.10 – MKLT and DCT coefficients for the previous example [18].
From Figure 3.10, it is possible to see not only the high energy compaction achieved by the MKLT (with almost
all the energy concentrated in the top-left coefficients), but it is also possible to compare it with the
corresponding DCT performance for this particular block, which distributes the same input signal energy along a
greater number of transform coefficients.
A further analysis can be made regarding the scan order for each transform. Considering a zigzag scan for the
DCT coefficients and ordering the MKLT coefficients by decreasing variance (as referred in the previous
section), it is possible to plot the coefficients in Figure 3.10 in terms of its amplitude and scan position as shown
in Figure 3.11.
Figure 3.11 – MKLT and DCT coefficients amplitude versus scan position [18].
27
The chart in Figure 3.11 shows that, for this example, the MKLT not only compacts the input signal energy in
fewer coefficients, but these coefficients are also the first to be scanned. On the other side, the input signal
energy is distributed along a larger number of DCT coefficients and additionally the zigzag scan does not seem
to efficiently arrange them by decreasing amplitude. As a consequence, it should be possible to entropy code the
MKLT coefficients with fewer bits than those required to code the DCT coefficients.
3.1.4. Performance Evaluation
To evaluate the performance of the proposed adaptive transform solution (dynamically selecting between the
DCT and MKLT), Biwas et al. have integrated it in the H.264/AVC video coding standard, notably in the JM
reference software, version 10.1 [19]. The experimental tests were conducted with QCIF and CIF resolution
video sequences and a frame rate of 30 fps for the following sequences: Foreman, Mobile, Garden and Husky.
The tests were made by encoding 50 frames of each video sequence and measuring the resulting PSNR as,
(3.4)
where MSE is the mean square error between the original and the reconstructed video frames.
To assess the benefits of the proposed solution (H.264 AT), its performance has been compared with the standard
H.264/AVC video codec (H.264 Standard) performance (the precise profile is not specified). The coded video
sequences use a regular pattern of one I-frame followed by P-frames for every group-of-pictures (GOP). The
coded video sequences have been selected as especially difficult to code with the DCT, notably including high
details and/or block areas with high variances and diagonal edges. Figure 3.12 shows the rate-distortion
performances for the proposed (H.264 AT) and benchmark codecs (H.264 Standard).
Figure 3.12 – RD performance for the H.264 Standard and H.264 AT video coding solutions [15].
28
Figure 3.12 shows an average PSNR gain of 0.5 dB for the proposed H.264 AT solution regarding the standard
H.264/AVC solution [15]. The sequence Mobile can even achieve PSNR gains of about 0.9 dB or alternatively
bitrates saves of about 20% (for the same quality) [15].
3.1.5. Summary
In this section, the video coding solution proposed by Biswas et al. [15] including an adaptive transform has
been reviewed in detail as it will play a central role in this Thesis. This coding solution makes use of a modified
KLT which allows the exploitation of the KLT optimal properties without requiring the coding and transmission
of its basis functions to the decoder. In this way, and in conjunction with the usual DCT, this video coding
solution can achieve a significant improvement in terms of compression performance when compared to the
actual state-of-the-art video coding standard, the H.264/AVC codec.
As already mentioned, the KLT-based technique proposed in this video coding solution is adopted for the video
coding solution to be studied in this Thesis. However, as this Thesis intends to address highly efficient solutions
for HD video content, it is more appropriate to integrate it in the HEVC standard, currently under development.
In this context, the next section is dedicated to the description of this emerging standard, focusing on its new
features but also on the main differences regarding the state-of-the-art H.264/AVC standard.
3.2. Introduction to the High Efficiency Video Coding Standard
The High Efficiency Video Coding (HEVC) standard is currently under development by the JCT-VC group
(jointly created by ISO/IEC MPEG and ITU-T VCEG) and it is planned to be ratified as a standard by January
2013 [20]. This new video coding standard targets providing 50% improved compression efficiency regarding
the state-of-the-art H.264/AVC video coding standard, notably for high and ultra high resolution video.
Officially, the HEVC standard development started in January 2010 with the publication of a Call for Proposals
(CfP) [21] asking for the submission of advanced video coding tools, specially targeting high and ultra high
resolution video. This CfP received 27 submissions with new coding tools and techniques providing encouraging
results in terms of coding efficiency when compared to H.264/AVC. These results led to the combination of the
most promising submitted coding tools into a new video codec called Test Model under Consideration (TMuC)
[22]. As the first available preview of the upcoming video coding standard, this test model will be used
throughout this Thesis as the target video codec to be improved.
3.2.1. Objectives
The development of technologies allowing the capture and display of high definition video contents has caused
an increasing presence of this type of resolutions in emerging multimedia applications. In future years, this
growth will not be limited to HD but it will also evolve to ultra high definition video contents (e.g. 7680×4320
pixels, which is 16 times the HD resolution). Undoubtedly, this type of contents requires higher bandwidth and
storage capacities, which do not seem to be available with the currently available transmission and storage
solutions. This problem can only be overturned by a significant improvement on the compression efficiency
provided by the actual video coding state-of-the-art associated to the H.264/AVC standard codec. Bearing this in
mind, the JCT-VC started the standardization of a new video coding standard with the objective of reducing by
half the bitrate needed to code a video sequence when compared to H.264/AVC High profile while maintaining
the same video quality. Clearly, this objective has the potential to cause an increase on the final codec
complexity. This standard targets coding progressive scanned content and video resolutions from QVGA
(320×240 pixels) to UHD (7680×4320 pixels).
3.2.2. Technical Approach
Since all the proposals submitted to HEVC Call for Proposals made use of the basic video coding architecture
used in previous video coding standards, particularly H.264/AVC, the HEVC coding architecture is also based
on intra and inter coding modes using motion compensated prediction and transform coding [23]. The basic
HEVC encoder architecture is presented in Figure 3.13.
29
Figure 3.13 – Basic HEVC encoder architecture [24].
Taking into account that a major difference between the HEVC and H.264/AVC standards relates to their target
resolutions, the submitted proposals focused their efforts on the exploitation of the higher spatial and temporal
redundancies available on high and ultra high definition video contents. These efforts resulted in various new
coding tools that change some of the main architectural modules as highlighted in Figure 3.13. These new
coding tools are described in the following, with the exception of those related to transform coding which are
explained later in more detail considering the topic of this Thesis.
Picture partitioning
First, the HEVC standard introduces a new image partitioning scheme based on a novel coding unit definition
and not anymore the usual macroblocks. The previous macroblock concept is replaced by a more flexible
structure comprised by Coding Tree Blocks (CTB). With this structure, each CTB can have various sizes (from
8×8 to 128×128, using always powers of 2) and can be recursively split according to a quad-tree partitioning.
The maximum size of a CTB and the maximum depth of the quad-tree partitioning are defined at the sequence
level.
The largest coding unit is denominated LCTB and the smallest SCTB. Each picture is divided into non-
overlapping LCTBs, and each CTB is characterized by its LCTB size and its hierarchical depth in relation to its
corresponding LCTB. To better understand this structure, Figure 3.14 illustrates an example where the LCTB
size is 128 and the maximum hierarchical depth is 5.
As shown in Figure 3.14, the recursive structure is represented by a series of split flags. If the split flag is equal
to 1, then the current CTB is split into four independent CTBs, characterized by an incremented depth and half
the size of the previous CTB. The picture partitioning is stopped when the split flag equals 0 or when the
maximum depth is reached, thus achieving the SCTB.
With this new partitioning structure, it is possible to code large homogeneous areas with larger coding blocks
than the previous 16×16 macroblocks used in H.264/AVC, allowing a better exploitation of the spatial
redundancy. Additionally, this coding structure allows a more flexible choice of the block sizes for a more
efficient coding of various contents, targeting multiple applications and devices.
30
Figure 3.14 – Illustration of a recursive CTB structure with LCTB size = 128 and maximum hierarchical depth
= 5 [22].
When the splitting process is finalized, the leaf nodes of the CTB hierarchical tree become Prediction Units (PU)
and can be split in the following ways:
Intra PUs – The intra PUs are not split or are split into 4 equal partitions.
Inter PUs – The inter PUs can have 4 symmetric splittings, 4 asymmetric splittings or can be split with
a geometric partitioning mode. In this last mode, the block is divided into two regions by a straight line
which is characterized by two parameters: the distance between the partition line and the block origin
(ρ), which is measured by a line perpendicular to the partition line, and the angle subtended by this
perpendicular line and the x axis (θ); for an example, see Figure 3.15.
Figure 3.15 – Parameters defining the geometric partitioning of a PU [22].
Besides the CTBs and PUs, the HEVC standard also introduces the Transform Units (TU). These units are
defined for transform and quantization purposes and can be as large as the size of the corresponding CTB leaf,
i.e., the corresponding PU. The partitioning of TUs is also represented by quad-trees, with their maximum size
31
and hierarchical depth being signaled in the bitstream. The transform block sizes are constrained to the
maximum and minimum transform sizes, 4×4 and 64×64, respectively. These characteristics are reviewed in
more detail in the following section dedicated to the transforms and quantization.
Intra prediction
For intra-coded blocks, the HEVC standard supports up to 33 spatial prediction directions for 8×8 to 64×64
blocks; this is done by means of a planar prediction mode. For 4×4 blocks, the 9 prediction modes already
present in H.264/AVC are used.
Motion compensation
To allow the exploitation of quarter-pixel accuracy motion vectors, the reference frame has to be upsampled and
be able to provide quarter-pixel accuracy interpolation. In H.264/AVC, to achieve this interpolation, a 6-tap
fixed Wiener filter is first used for half-pixel accuracy interpolation, followed by a bilinear combination of
integer and half-pixel values. With HEVC, it is possible to use a 12-tap DCT-based interpolation filter to provide
the same quarter-pixel accuracy interpolation. In this way, only one filtering procedure is needed, allowing a
simplification of the implementation and a complexity reduction of the filtering process.
Deblocking filter and In Loop filter
The H.264/AVC deblocking filter has been adapted in the HEVC codec to support the new larger block sizes.
Moreover a symmetric Wiener filter has been added to allow a reduction of the quantization distortion in the
reconstructed blocks.
Entropy coding
The HEVC standard offers two kinds of entropy coding methods:
Low-complexity entropy coding – For low-complexity, 10 pre-determined VLC tables are designed
for different probability distributions; each syntax element uses one of the 10 referred tables. For the
entropy coding of the transforms coefficients, an improved CAVLC method is used.
High-complexity entropy coding – For high-complexity, a variation of the CABAC solution defined
in H.264/AVC is employed. The bases of this codec are similar but the parallelization of the entropy
encoding and decoding is introduced.
These are the main technical novelties introduced in the emerging HEVC standard. In the following section, a
more comprehensive description of the adopted transforms is made, as this is the main topic of this Thesis.
3.2.3. Transform and Quantization
A larger transform can bring high performance improvements in terms of energy compaction and reduced
quantization error for large homogeneous areas (this is studied with more detail in Section B.1). HD sequences
tend to have more spatial correlation, this means in larger parts of the image. Thus, HEVC introduces three
additional transform sizes besides those already supported by H.264/AVC (4×4 and 8×8): 16×16, 32×32 and
64×64. With the increase of the transform size, the complexity also tends to increase. To minimize this
complexity, HEVC makes use of the fast DCT algorithm proposed by Chen in [25]. This type of algorithm is
used due to its reduced implementation complexity and its readily extension to larger transform sizes. In Figure
3.16, the signal flow graph of Chen‟s fast factorization for an order-16 DCT is presented.
32
Figure 3.16 - Signal flow graph of Chen’s fast factorization for an order-16 DCT [22].
In Figure 3.16, the multiplication constants are represented by sinusoidal functions of particular angles, which
can result in floating point operations; to overcome this drawback, pre-defined values are used (see Table 3.1).
With this approximation, the transform loses its orthogonal property but the errors associated are considered less
significant than the complexity increase corresponding to the floating point operations.
Table 3.1 - Approximated constants for an order-16 DCT [22].
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15
63/64 62/64 61/64 59/64 56/64 53/64 49/64 45/64 40/64 35/64 30/64 24/64 18/64 12/64 6/64
Besides the DCT, two types of directional transforms are adopted in the HEVC standard. These transforms are
used when the DCT basis functions do not offer a good transform performance, e.g. uncorrelated signals or
blocks with strong diagonal edges. The first directional transform is a Rotational Transform (ROT), which is
applied as a second transform after the DCT for blocks of 16×16 and higher sizes. The basic principle behind
this directional transform is the rotation of the transform basis coordination system, instead of the rotation of the
input data. The used rotation matrices (for vertical and horizontal rotations) are [22]:
The α angles represent the six possible rotation angles. From these six angles, only four rotation angles can be
quantized and used to minimize the complexity of the encoder. In this context, it has also to be noted that, for
TUs larger than 8×8, the ROT is only applied to the 8×8 lowest-frequency DCT coefficients.
The second type of directional transform is the Mode-Dependent Direction Transform (MDDT) which is used to
encode 4×4 and 8×8 intra prediction residuals and is paired with the selected intra prediction mode. The 33 intra
prediction modes for the 8×8 block size are grouped into nine separate directions; the MDDT is designed with
nine separate basis functions, one for each direction. These basis functions are estimated from the statistics of the
intra prediction residuals for each mode, using a separable transform based on the KLT, the Singular Value
Decomposition (SVD). This transform is used to better exploit the spatial redundancy (versus the DCT) without
1 2 3
1 3 1 2 3 1 3 1 2 3 2 3
1 3 1 2 3 1 3 1 2 3 2 3
1 2 1 2 2
, ,
cosα cosα - sinα cosα sinα -sinα cosα - cosα cosα sinα sinα sinα 0
cosα sinα + sinα cosα cosα -sinα sinα + cosα cosα cosα -sinα cosα 0
sinα sinα cosα sinα cosα 0
0 0 0 1
vertical
horizonta
R
R
4 5 6
4 6 4 5 6 4 6 4 5 6 5 6
4 6 4 5 6 4 6 4 5 6 5 6
4 5 4 5 5
, ,
cosα cosα - sinα cosα sinα -sinα cosα - cosα cosα sinα sinα sinα 0
cosα sinα + sinα cosα cosα -sinα sinα + cosα cosα cosα -sinα cosα 0
sinα sinα cosα sinα cosα 0
0 0 0 1
ll
33
excessively increasing the transform complexity (versus the KLT). In this way, the SVD is used first in the
vertical and then in the horizontal directions. Once again, to save computational effort, the transform matrices
are fixed-point approximated.
After the transform operation, the resulting coefficients are quantized in the same way as in H.264/AVC and
rearranged in a 1-D vector for entropy encoding. Besides the zigzag scanning order used for the DCT
transformed coefficients (including those using the ROT), a new scanning order is used for the MDDT
transformed coefficients, based on the directional mode used in the intra prediction coding. With this, it is
possible to compact the non-zero coefficients to the beginning of the resulting 1-D vector.
3.2.4. Summary
In this section, the HEVC standard under development was introduced. As this standard is not yet fully
developed, the previously reviewed coding tools correspond to the TMuC (Test Model under Consideration)
codec to be used later in this Thesis (more specifically TMuC software version 0.9). In the meantime, some tools
have been removed from the most recent versions of this codec - now called HEVC Test Model (HM) -
particularly the MDDT transform and the geometric partitioning mode.
The HEVC is being developed with the purpose of replacing the H.264/AVC video coding standard as the state-
of-the-art video coding standard. Moreover, it is designed taking into account that some main emerging
applications will use soon high and ultra high definition video contents. The new set of coding tools reflects this
concern as it focus on exploiting the highest spatial and temporal redundancies present in this type of sequences.
3.3. Final Remarks
In this chapter, the two most important background technologies for the solution to be implemented and studied
in this Thesis have been introduced. First, the adaptive transform technique proposed by Biwas [15] was
described. This transform uses the standard H.264/AVC DCT and a modified KLT which uses the MCP blocks
to estimate the prediction error and to subsequently calculate its basis functions. This technique is both
integrated in the H.264/AVC encoder and decoder and can bring improvements to the overall transform
performance when compared to the DCT alone, particularly for signals that are hard do compact using the DCT.
Secondly, the currently under development HEVC standard has been introduced; this is the codec adopted in this
Thesis as it is the most advanced. The HEVC standard, or at least its test model, introduces some new coding
tools that were designed to better exploit the special characteristics of high and ultra high definition video
contents, notably higher spatial and temporal correlations. Amongst the new coding tools, the main differences
in comparison to the H.264/AVC video coding standard associated to the topic of this Thesis (i.e. transform
coding) are related to the type of picture partitioning, notably with a more flexible partitioning allowing various
block sizes (from 8×8 to 128×128), and the transform sizes, notably with transform sizes from 4×4 to 64×64.
Unrelated to the type of video content, but related to transform coding, directional transforms are introduced to
better exploit the directional edges present in many blocks.
In the next chapter, the implementation details of the adopted transform coding solution will be presented.
35
Chapter 4
Adopted Coding Solution Functional
Description and Implementation Details
After the introduction of the two most important background technologies in Chapter 3, this chapter intends to
describe in detail the coding solution adopted in this Thesis, notably a functional description of each module and
its main associated features and a detailed explanation of its implementation.
The adopted solution central technical element, the transform coding block, is based on the adaptive transform
proposed in [15]. To better understand the reasons for the development, implementation and evaluation of this
coding solution, its objectives are first defined. After, its general architecture is presented, followed by a brief
walkthrough. Finally, each module in the presented architecture is individually described, analyzed and
explained, both from the functional and implementation points of view.
4.1. Objectives
As reviewed in Section 3.1, the solution proposed in [15] can achieve significantly better compression
performance than the standard H.264/AVC codec. This is achieved by means of an adaptive transform that can
switch between the standard H.264/AVC DCT and a modified KLT, whose basis functions are computed using
the same estimation technique in both sides of the coding process, thus not requiring its transmission along with
the remaining bitstream. Additionally, as referred in Section 3.2, the JCT-VC team is currently developing a new
video coding standard, the HEVC standard, which is intended to double the H.264/AVC compression efficiency,
particularly for high and ultra high definition video contents. With this in mind, the coding solution adopted and
implemented in this work uses the adaptive transform proposed in [15] with three main goals:
Adaptive transform performance evaluation in the context of the HEVC standard – As noted
before, the solution proposed in [15] was integrated and evaluated in the context of the H.264/AVC
standard. To evaluate the coding performance of the referred adaptive transform in the context of the
36
emerging HEVC standard, this tool must be, at least partly, integrated in this new video coding
standard.
Adaptive transform performance evaluation for high definition video content – In [15], the
proposed adaptive transform was only evaluated for QCIF and CIF resolution video sequences.
However, as noted in Section 3.2, the use of high definition video contents in various multimedia
applications is growing quickly. Thus, it is very relevant to assess the performance of the adaptive
transform for HD video contents, using the HEVC codec, to understand if the performance gains
obtained for the lower resolution contents still persist [15].
Adaptive transform performance evaluation for larger shift and rotation parameters – Finally, it
was noted in Section 3.1 that the motivation behind the specific choice of the used maximum shift and
rotation parameters is not explained in [15]. In this context, it is relevant to assess the performance with
increasing parameters values to check if this change can bring further compression performance
improvements.
To achieve these goals, a new coding solution is designed, developed and after evaluated using the same
concepts of the adaptive transform proposed in [15], although with some implementation changes. The technical
aspects of this new coding solution are presented in the following sections.
4.2. Architecture and Walkthrough
As referred before, the adopted video coding solution is based on the tool proposed, in 2010, by Biswas et al.
[15]. Thus, it also uses a similar adaptive transform technique to code the prediction error associated to the inter-
coded blocks. In this solution, the adaptive transform can switch between the standard H.264/AVC DCT
(Section A.8.3) and a modified KLT (very similar to the MKLT presented in Chapter 3) to obtain a better
compression performance, depending on the particular details of the image area being coded. It was also referred
above that this coding solution is based on the new HEVC codec as a replacement for the H.264/AVC codec
used in [15]. However, it has to be noted that the proposed adaptive transform is not integrated in the codec
reference software that is usually made available by the standardization groups, in this case the JCT-VC team.
The full integration was not made because it would not only require detailed knowledge of the software structure
and organization, which in this case would involve major extra time since this is a new software, still under
development, but it would also require major software development and testing which is not the main objective
of this Thesis. As a reasonable compromise, HEVC encoded and decoded data is obtained/extracted (using the
HEVC reference software) and used externally to simulate a large portion of the actual coding framework; for
example, the HEVC entropy coding tool is not used. In this way, the developed coding solution is only
applicable at the frame level, since the reference frames used for the inter-coded frames are always extracted
from the HEVC codec and are not decoded from previous codings using the developed coding solution.
The general architecture of the solution designed and implemented in this Thesis is presented in Figure 4.1. This
solution is only used to code the prediction error block; thus, it uses only the inter-coded frames as input.
Additionally, the bitstream generated by its encoding process and the reconstruction made by the decoding
process only contain information about the prediction error.
37
Figure 4.1 – Architecture of the developed coding solution.
The architecture presented in Figure 4.1 includes three main processes which are described next and clearly
identified in the figure with different colors:
HEVC framework – This process is used to extract data from the HEVC to be used in the encoding
and decoding processes of the adopted coding solution. In order to do this, the original frame is inter-
coded with the HEVC codec and the following data is extracted:
o Transform Units (TUs) split flags and coding modes.
o Reference frame.
o Prediction Units (PUs) motion vectors.
The extracted data is then provided to both the encoder and the decoder processes.
AT encoder – This process is used to encode each TU prediction error block using the proposed
adaptive transform. Additionally, the coefficients generated by this transform are also quantized and
entropy encoded. The modules of this process are processed in the following steps:
o Reference frame upsampling – First, the reference frame extracted from the HEVC
framework is upsampled to provide quarter-pixel accuracy.
o Frame partitioning – To process each TU individually, the original frame to be inter-coded is
first partitioned in its TUs using the HEVC defined partitioning method. This partition is made
with the split flags extracted from the HEVC framework. After the partitioning in TUs of the
full frame, only the inter-coded TUs continue the coding process. To verify which TUs were
HEVC intra or intra-coded, the coding modes extracted from the HEVC framework are used.
o MCP block computation – Then, the MCP block associated to each TU is computed using
the extracted motion vector and the upsampled reference frame. This MCP block is then
subtracted to the original TU, resulting in the prediction error block.
38
o Forward adaptive transform – The prediction error block is then transformed using the two
available transforms, the DCT and the MKLT. To compute the MKLT basis functions as in
[15], the motion vectors, the upsampled reference frame and the MCP block obtained in the
previous steps are used.
o Quantization – Each transform coefficient is after quantized using a uniform quantizer
described later.
o Entropy encoder – To finalize the encoding process, the quantized coefficients are entropy
encoded. Then, the resulting bitstreams for each transform (DCT and MKLT) are compared,
and the one corresponding to fewer bits is selected and sent to the decoder. This bitstream
corresponds to the adaptive transform bitstream in Figure 4.1.
With the adaptive transform bitstream sent to the decoder side, the encoding process is concluded. Finally,
the decoder process performs as follows:
AT decoder – This process is used to decode the adaptive transform bitstream sent by the encoder. To
do this, each inter-coded TU bitstream is entropy decoded, inverse quantized and inverse transformed
and the resulting reconstructed prediction error blocks are then rearranged to form the reconstructed
prediction error frame. The modules of this process are processed in the following steps:
o Reference frame upsampling – At the beginning of the decoding process, the reference frame
is once again upsampled to provide quarter-pixel accuracy.
o MCP block computation – Like the encoder, the MCP block is computed for each TU. In this
case it is only used for the MKLT basis functions computation.
o Entropy decoder – To decode the adaptive transform bitstream, it is first entropy decoded,
resulting in the adaptive transform quantized coefficients.
o Inverse quantization – The quantized coefficients obtained in the previous step are then
inverse quantized, resulting in the reconstructed adaptive transform coefficients.
o Inverse adaptive transform – These reconstructed coefficients are then inverse transformed
using the basis functions of the transform selected in the encoding process. To compute the
MKLT basis functions, in case they are needed, the motion vector, the upsampled reference
frame and the MCP block are once again used. With this operation, the reconstructed
prediction error block is obtained for each inter-coded TU.
o Frame reconstruction – To conclude the decoding process, all the reconstructed prediction
error blocks are arranged in the reconstructed prediction error frame, using the split flags and
the coding modes data extracted from the HEVC framework. This frame is only comprised by
the inter-coded TUs, since the intra-coded ones are not coded with the developed coding
solution.
With this, the decoding process of the adopted coding solution is concluded.
To implement the adopted coding solution, two programming environments were used: the TMuC software
(version 0.9 [20]) and the MATLAB numerical computing environment [26]. These environments were used
with the following purpose:
TMuC software (version 0.9) – This software was used to process the HEVC encoder and decoder
modules. This software code (programmed in C++) was changed in order to provide the necessary data
to the encoding and decoding processes.
MATLAB numerical computing environment – To process the encoding and decoding processes, a
MATLAB script was programmed. This script is comprised by some functions already present in the
MATLAB toolboxes and others designed and programmed especifically for this solution by the author
of this Thesis. It has to be noted that the decoding process was not implement independently from the
encoder. Thus, some of the modules present in the AT decoder architecture shown in Figure 4.1 were
not actually implemented, as the information provide by them was already known.
Following this short walkthrough of the developed video coding solution, the next sections will individually
explain each of the modules presented in Figure 4.1. To avoid repeating the more conceptual description made in
Chapter 3, this explanation will concentrate on the work developed by the author of this Thesis, notably focusing
39
on the implementation aspects. This explanation will be made first for the HEVC framework, then for the AT
encoder and, finally, for the AT decoder.
4.3. HEVC Framework Functional Description and Implementation Details
As referred before, the adopted coding solution uses a HEVC framework to provide the data needed by both the
AT encoder and the AT decoder in the right conditions. To do this, the original frame is coded using a slightly
modified version of the TMuC software (version 0.9). This software‟s code has been changed to provide the data
corresponding to each TU split flags and coding mode, used in the frame partitioning module, and the motion
data (i.e. each PU motion vector and the reference frame), used in the MKLT basis functions computation. To
understand how this data is extracted and stored, consider the CTB presented in Figure 4.2, with its PU
partitioning (a) and TU partitioning (b). As referred in Chapter 3, each CTB leaf represents a PU for motion
prediction coding and a TU for transform coding. These TUs can be further partitioned in smaller TUs using the
quad-tree partitioning method.
Figure 4.2 – Example of (a) PU partitioning and (b) TU partitioning of a 32×32 CTB.
The CTB in Figure 4.2 is a 32×32 block that is partitioned using the quad-tree technique employed in the HEVC
coding process. In this example, the CTB is partitioned in (a) PUs and in (b) TUs. The PU partitioning results in
7 PUs (3 of size 16×16 and 4 of size 8×8). The TU partitioning results in 10 TUs (3 of size 16×16, 3 of size 8×8
and 4 of size 4×4). With this example in mind, the extraction and storage procedures used in the HEVC decoder
are explained next:
TUs split flags – The partitioning split flags are extracted to provide the frame partitioning module
information on how to partition each frame in its corresponding TUs. Thus, the depth value of each TU
in relation to its corresponding LCTB is stored at the frame level with the granularity of the SCTB size.
This means that, for a R×C frame and a s×s SCTB, there will be a total of
×
depth values. Figure 4.3
shows how this data is stored considering that the CTB in Figure 4.2 represents a LCTB.
40
Figure 4.3 – TU depths for the CTB in Figure 4.2 (b).
From Figure 4.3, it is possible to see how the TU partitioning is signaled; to each TU corresponds a depth in
relation to the LCTB. In this way, three different TU sizes can be identified: 16×16 (depth = 1), 8×8 (depth
= 2) and 4×4 (depth = 3). The dashed grid identifies the SCTB size, in this case, 4×4. Each of these SCTB
sized blocks has a number corresponding to the depth of the TU where it is contained. The precision of the
saved data is ANSI-C int (32 bits).
TUs coding modes – Only the inter-coded TUs are processed with the adopted coding solution, since
the motion data is essential for the MKLT basis functions computation. Thus, only the TUs that were
inter-coded in the HEVC coding process can be coded with the proposed adaptive transform. With this
purpose, each TU coding mode must be identified. To do this, each TU must have a flag identifying its
coding mode: if it is a „0‟, then intra-coding was performed; otherwise („1‟), inter-coding was
performed. This information is also extracted for each TU and stored at the frame level with the
granularity of the SCTB size, resulting in
×
values for each frame. For the CTB in Figure 4.2, an
example result is shown in Figure 4.4.
Figure 4.4 – Coding modes (intra-coding = ‘0’ and inter-coding = ‘1’) for the CTB in Figure 4.2 (b).
By observation of Figure 4.4, it is possible to conclude that there are only 2 intra-coded TUs in the
considered CTB (both identified with a „0‟). The values are stored in ANSI-C short precision (16 bits).
Motion vectors – To compute the MCP block and the prediction error estimations necessary for the
computation of the MKLT basis functions, also the motion vectors of each PU have to be extracted.
Thus, for each PU, the horizontal and the vertical motion vectors values are provided and stored. As for
both previously described cases, this extraction also uses the granularity of the SCTB size, resulting in
×
values for each frame and for each direction (horizontal and vertical). Figure 4.5 shows the (a)
horizontal and (b) vertical motion vectors values for the CTB in Figure 4.2.
41
Figure 4.5 – (a) Horizontal and (b) vertical motion vectors values for the CTB in Figure 4.2 (a).
As expected, the TUs that were intra-coded (see Figure 4.4) do not have any motion vector value associated
to them. The motion vectors values are stored in ANSI int precision (32 bits).
Reference frame – The reference frame is essential to process the motion compensation module, since
it is the source referenced by the motion vectors values. In this way, the reference frame for each inter-
coded frame is stored in a R×C file with ANSI-C short precision (16 bits).
All the extracted data is saved in binary files (.bin) to be read by the developed MATLAB script. At this stage,
the developed coding solution implementation passes from the TMuC environment to the MATLAB
environment and the developed MATLAB script start its execution by reading the saved binary files values and
copying them to previously memory allocated matrices with the sizes referred before. This results in the
following matrices:
Split flags matrix – A
×
matrix containing the split flag of each TU.
Coding modes matrix – A
×
matrix containing the coding mode of each TU.
Motion vectors matrices – A
×
matrix containing each PU horizontal motion vector component
and a
×
matrix containing each PU vertical motion vector component.
Reference frame matrix – A R×C matrix containing the pixel values of the currently reference frame.
Besides these matrices, which are read from files saved with the TMuC software, the original frame available in
the original sequence file is also read and copied to a R×C matrix, denominated original frame matrix.
4.4. AT Encoder Function Description and Implementation Details
In the adopted coding solution, the encoder is basically used to code the prediction error block associated to each
TU. The key tool to code this type of data is associated to the transform coding which is exactly the tool under
study in this work. This process includes the following modules (as shown in Figure 4.1): reference frame
upsampling, frame partitioning, MCP block computation, forward adaptive transform, quantization and entropy
encoder. These modules are explained in detail in the following.
4.4.1. Reference Frame Upsampling
To provide the half and quarter-pixel prediction accuracy associated to the motion vector and perform the
prediction error estimation technique used in the adaptive transform computation, the relevant reference frame
needs to be upsampled with an upsampling factor of 4 (L = 4). This operation is still done at the frame-level and
performed by means of the 12-tap DCT-based interpolation filter described in Section 3.2.2, whose coefficients
are listed in Table 4.1.
42
Table 4.1 – 12-tap DCT-based interpolation filter coefficients [22].
Interpolation Filter coefficients
Quarter-pixel {-1, 5, -12, 20,-40, 229, 76, -32, 16, -8, 4,-1 } (18 additions,
6 shifts)
Half-pixel {-1, 8, -16, 24,-48, 161, 161, -48, 24, -16, 8,-1 } (15
additions, 4 shifts)
3 quarter-pixel {-1, 4, -8, 16, -32, 76, 229, -40, 20, -12, 5,-1 } (18 additions,
6 shifts)
To better understand the reference frame upsampling operation, consider the half and quarter-pixel motion
positions illustrated in Figure 4.6.
Figure 4.6 – Half and quarter-pixel motion positions illustration [22].
As shown in Figure 4.6, the interpolation of the integer pixels A, B, C and D results in the half and quarter-pixels
identified from a to o. These last pixels are determined using the above referred filter with the following
approach:
First, the half-pixel interpolations are computed using the integer pixels A, B and C. With this, it is
possible to obtain the pixels b (half-pixel horizontal interpolation of A and B) and h (half-pixel vertical
interpolation of A and C).
Then, the quarter-pixel interpolations are computed using the same integer pixels. With this, the pixels
a (quarter-pixel horizontal interpolation of A and B), c (3 quarter-pixel horizontal interpolation of A
and B), d (quarter-pixel vertical interpolation of A and C) and l (3 quarter-pixel vertical interpolation of
A and C) are obtained. These interpolations are computed for all the integer pixels.
After this, the pixels f, j and n are obtained by computing a half-pixel horizontal interpolation of d, h
and l and their corresponding pixels in relation to the integer pixel B, respectively.
Finally, the pixels e, i and m are obtained by computing a quarter-pixel horizontal interpolation and the
pixels g, k and o are obtained by computing 3 quarter-pixel horizontal interpolations of d, h and l and
their corresponding pixels in relation to the integer pixel B, respectively.
By applying these interpolations to all pixels in the reference frame, the upsampled reference frame is obtained
as shown in Figure 4.7.
43
Figure 4.7 – Upsampled reference frame illustration.
Figure 4.7 shows an illustration of an upsampled reference frame, where each darker blue square represents an
integer pixel, present in the actual reference frame. To implement this reference frame upsampling computation
in the developed MATLAB script, a function developed and provided by Dr. Matteo Naccari has been used. This
function receives the reference frame matrix as input and, by applying the interpolations described before,
returns the upsampled reference frame matrix.
4.4.2. Frame Partitioning
This partition is performed to allow the transformation and quantization of each TU individually as made in the
HEVC standard. With this in mind, the frame is divided in CTBs with the largest possible size, i.e. LCTBs,
defined at the sequence level. After, the HEVC quad-tree partitioning is replicated, with each CTB recursively
split into 4 blocks with half the height and width of their parent CTB, until the maximum depth is reached. Each
leaf CTBs in this operation is considered a TU and is processed individually. At this stage, the processing moves
from the frame level to the TU level, as intended.
After the frame partitioning, a verification has to made to check if the processed TU was inter or intra-coded in
the HEVC encoder module. This verification is performed with the help of the coding modes data. If the TU was
intra-coded, its processing stops here; otherwise, if it is an inter-coded TU, its processing continues to the
following steps.
In terms of the actual implementation, this partitioning is made with a MATLAB function especially designed
and programmed for this purpose. This function uses the following steps:
Partitioning in LCTBs – First, the number of LCTBs in a row and in a column are computed by
dividing the number of rows and columns in the frame by the LCTB width (or height), respectively.
With this information, it is then possible to go through all the LCTBs first pixel positions, using a
combination of two condition cycles; one having a number of iterations equal to the number of LCTBs
in a row and the other having a number of iteration equal to the number of LCTBs in a column.
Partitioning in TUs – For each LCTB, a recursive function is then used. This function basically starts
with a reference depth value of 0, a reference width value equal to the LCTB width (the LCTB height
could also be used here) and the first pixel position of the current LCTB (current pixel position) as
inputs. Then, the reference depth value is compared to the depth value present in the split flags matrix
position corresponding to the current pixel position. If the reference depth value is smaller than the
depth value of the current pixel position, its value is incremented and the reference width value is
divided by 2. Then, the recursive function is computed again with the newly computed reference values
44
and with the following 4 pixel positions as inputs (corresponding to the first pixel positions of the 4 new
block partitions):
o The current pixel position.
o The pixel position distanced reference width pixels away from the current pixel position
horizontally.
o The pixel position distanced reference width pixels away from the current pixel position
vertically.
o The pixel position distanced reference width pixels away from the current pixel position both
horizontally and vertically.
This is done until the reference depth value is equal to the depth value of the pixel position being processed.
When this happens, the TU level has been reached.
TU coding mode – Next, to verify if a particular TU was intra or intra-coded, the coding modes matrix
value corresponding to the pixel position being processed is checked: if it is „0‟ then the TU was intra-
coded and thus its processing stops here; otherwise (if it is „1‟), the TU was inter-coded and thus its
processing continues to the next step.
As a result of this partitioning, each inter-coded TU is clearly identified by its first pixel position and by its size
in the remaining steps of the adopted coding solution.
4.4.3. Motion Compensation Prediction Block Computation
As referred before, the MCP block is needed to compute the MKLT basis functions as proposed in [15]. Besides
this, it is also necessary to obtain the prediction error block at the encoder side, which is essential as this is the
data that is going to be transformed at the encoder and reconstructed at the decoder. Thus, the MCP block has to
be determined. To do this, the motion vector of a particular TU is used. This motion vector points to the position
of the MCP block at the reference frame. In reality, to provide half and quarter-pixel accuracy, it points to the
position of the Upsampled MCP (UMCP) block at the upsampled reference frame. Thus, to compute the MCP
block, the UMCP is first obtained and it is after downsampled at the end of this process.
Considering that a particular TU first pixel position (always considered the top-left corner pixel position) and
size are known (as a result of the frame partitioning module process which was described before), the
determination of its MCP block in the developed script is performed by the following sequence of steps:
1. Scaling of the first pixel position - First, the position of the first pixel of the currently being coded TU
is scaled by a factor equal to the upsampling factor used in the reference frame upsampling (L=4). To
do this, the variables representing the first pixel position (x and y) are multiplied by 4.
2. MCP first pixel position – Then, the motion vector values corresponding to the currently being coded
TU are obtained from the motion vectors matrix and added to the scaled pixel position obtained in step
1. This new position points to the first pixel position of the MCP block in the upsampled reference
frame.
3. Upsampled MCP block – It is then possible to crop the UMCP block from the upsampled reference
frame using the position obtained in step 2 as the first pixel of this block and considering that its size is
4 times the size of the currently being coded TU.
4. MCP block - Finally, to obtain the MCP block, the UMCP block is downsampled. To do this, all the
integer positions in the UMCP are arranged in a new block with the same size of the currently being
coded TU.
This operation is illustrated in Figure 4.8, for a particular 4×4 TU.
45
Figure 4.8 – Example of MCP block computation for a 4×4 TU.
The grid presented in Figure 4.8 represents the pixels of a portion of the upsampled reference frame. Pixel U
corresponds to the scaled position of the first pixel position of this particular TU (step 1). Adding the motion
vector values to pixel U, which are (3,-2) in this case, results in the position of pixel R, which is the first pixel of
the MCP block (step 2). Then, the UMCP block (delimited by a blue line in Figure 4.8) can be obtained by
cropping the resulting 16×16 block that starts on pixel R (step 3). Finally, to downsample this UMCP block, the
R and r pixels (representing the integer pixel positions of the UMCP block) are arranged in a 4×4 block, forming
the MCP block as in Figure 4.9 (step 4).
Figure 4.9 – MCP block for the example in Figure 4.8 after the downsampling operation.
After obtaining the MCP block, this module job is concluded and the encoding process continues to the forward
adaptive transform computation.
4.4.4. Forward Adaptive Transform
This module regards the main tool in the developed coding solution, the adaptive transform. The adaptive
transform is used to convert the prediction error block from the spatial-domain to the frequency-domain. As
referred before, the proposed adaptive transform uses two transforms, the DCT and the MKLT; thus, both these
transforms are computed. The decision about which transform coefficients are coded is only made after the
46
entropy encoder module, as this decision requires the knowledge of the number of bits required for each
transform resulting bitstream. The architecture of the forward adaptive transform is shown in Figure 4.10.
Figure 4.10 – Architecture of the forward adaptive transform module.
A detailed walkthrough of the architecture presented in Figure 4.10 is now presented to better understand each
processing block.
1) Forward DCT
The first transform to be computed is the DCT. This transform is a standard floating point 2-D DCT, already
described in Chapter 2. Thus, the forward DCT of a n×n prediction error block X is given by
(4.1)
where CDCT is the n×n DCT coefficients block and TDCT is the n×n DCT basis functions matrix defined as
(4.2)
where tDCT (j, i) represents the value of the DCT basis functions matrix at position (j, i). In the developed
MATLAB script, this transform is computed using the MATLAB function dctmtx [27], which returns the n×n
DCT basis functions. This basis functions matrix is then used to compute the DCT coefficients according to Eq.
(4.1).
With the DCT coefficients for the prediction error obtained, it is then possible to quantize and entropy encode
them.
2) Forward MKLT
Besides the DCT, the adaptive transform may also use a modified KLT. This MKLT is similar to the one
proposed in [15], with the only difference being related to the used shift (δ) and rotation (θ) parameters. The
MKLT computation involves three main steps: prediction error estimation computation, basis functions
computation and the MKLT transform computation itself. After the more conceptual description made in
Chapter 3, these three steps are described next with more focus on the implementation aspects.
a. Prediction error estimation computation
The implemented solution uses the same prediction error estimation technique described in Chapter 3, meaning
that it also uses the MCP block to estimate the prediction error by subtracting rotated and shifted MCP blocks to
the actual MCP block, resulting in a set of estimated prediction error blocks. The block rotations and shifts are
explained in the following, noting that both the shifts and rotations are applied over the upsampled MCP block
47
(UMCP) (with L=4) as both operations require quarter-pixel accuracy. To obtain the UMCP, the process used in
the MCP block computation is repeated, but now without the final downsampling step.
Rotations processing
First, the UMCP block is rotated by an angle θ using the following steps:
Coordinate system definition – Foremost, the block positions need to be converted to a new
coordinate system. This is done to use a rotation matrix R allowing the rotation of points in the xy-
Cartesian plane by an angle θ around the origin of the Cartesian coordinate system with a simple matrix
multiplication [28]. In this case, the block needs to be rotated around its centre; thus, the origin of the
Cartesian coordinate system must be the centre of the block. As all processed blocks have even width
and height (e.g. 4×4, 8×8, etc.), this centre is not a pixel position, but the intersection of the four central
pixel positions. With this in mind, the adopted coordinate system used to process the rotations is shown
in Figure 4.11 for a 4×4 block.
Figure 4.11 – Adopted coordinate system for a 4×4 block.
As shown in Figure 4.11, each pixel has size 2 in this coordinate system to allow using integer coordinate
values to reference the centre of the pixel positions. Thus, the odd coordinates are located at the middle of
the pixel positions, while the even ones are located at the intersections of the pixel positions. In this way,
only the odd coordinates refer to specific pixel positions, in this case corresponding to the centre of the
pixel. With this solution, there will be only odd coordinates when converting the block pixel positions to the
adopted coordinate system.
Rotation matrix definition – With the adopted coordinate system defined, it is now possible to
perform the rotation by an angle θ around the block origin by means of a matrix R given by [28]
(4.3)
Rotated coordinates – With this matrix, it is then possible to rotate a block using the following matrix
multiplication [28]
(4.4)
48
where (x’,y’) are the coordinates of the point (x,y) after rotation. Clearly, this operation can result in values
that are not odd for the rotated coordinates (x’,y’). Thus, all the values are rounded to the nearest odd value,
so they can reference an actual pixel position.
With these definitions, it is possible to rotate the UMCP block. For this effect, consider the UMCP block as part
of the upsampled reference frame, with the rotation axis centred at the centre of the UMCP block. To better
understand this, Figure 4.12 shows the rotation of an upsampled 4×4 MCP block (corresponding to a 16×16
block in reality) by an angle θ around its origin.
Figure 4.12 – Rotation of a 4×4 UMCP block by an angle θ around its origin.
Figure 4.12 shows the UMCP block (green coloured area) as part of the upsampled reference frame (blue
coloured area) before rotation. After an angle θ rotation is computed, the rotated UMCP block is identified by
the darker green and darker blue pixel positions. To better understand this operation, consider a window that
initially just shows the UMCP block, hiding the rest of the upsampled reference frame. Then, by rotating this
window, some previously shown pixel positions disappear and some previously hidden pixel positions appear;
this rotated window represents the rotated UMCP block.
As already referred, the rotations are computed for upsampled blocks. Thus, the angle by which the blocks are
rotated needs to be scaled by a convenient factor. To explain how this scaling factor is defined, consider Figure
4.13, where two vectors are displayed; a vector v1 connecting the point D (located on the x-axis and distanced d
from the origin) to point P1 (located on the y-axis and distanced h from the origin) and making an angle θ1 with
the x-axis, and a vector v2 connecting the same point D to point P2 (located on the y-axis and distanced L·h from
the origin, where L represents the upsampling factor) and making an angle θ2 with the x-axis.
49
Figure 4.13 – Two vectors, v1 and v2, connecting the same point D to two different points, P1 and P2,
respectively.
Considering that D can be the centre of a particular block, it is possible to consider that P1 is a pixel position and
P2 is the corresponding pixel position in the upsampled block (with an upsampling factor L). With this in mind,
it is also possible to consider that the angles θ1 and θ2 are the exactly same angle before and after the upsampling
process, respectively. Thus, to determine the convenient scaling factor for the rotation angle, it is only necessary
to find the relation between θ1 and θ2. In this way, taking into account the law of tangents, the tangent of θ1 and
θ2 can be given by
(4.5)
Combining the two equations in Eq. (4.5), it is then possible to obtain the following relation between θ1 and θ2
(4.6)
Using Eq. (4.6), and knowing that L=4, it is simple to obtain the scaled rotation angle for any θ value.
Concerning the θ values used to perform the rotations, besides the 0.0° and 0.5° rotation angles already used in
[15] (both clockwise and counter-clockwise), the developed coding solution also considers rotations up to a 1.0°
angle (in both directions). This results in a total of 5 possible rotations for each TU, notably 0, ± 0.5° and ±1°.
Applying the scaling factor in Eq. (4.6) to the previously mentioned θ values, results in 0.0°, 1.99° and 3.99°
rotation angles for the UMCP blocks, which are approximated to 0.0°, 2.0° and 4.0°, respectively.
In the developed script, the rotations are done with an especially programmed MATLAB function. This function
receives as inputs the position of the UMCP first pixel in the context of the upsampled reference frame
(determined as described in the MCP block computation module), the UMCP size, the upsampled reference
frame matrix and the rotation angle to be applied. Then the following steps are performed:
Rotation matrix definition – After the conversion from degrees to radians, the sine and the cosine of
the input angle are determined using the sin [29] and cos [30] MATLAB functions. With these values, it
is possible to define the rotation matrix as in Eq. (4.3).
New coordinate system definition – Then, each block position is converted to the coordinate system
defined before. To do this, each block position is arranged in a column vector and all these column
vectors (one for each block position) are arranged sequentially in a 3-D variable. With this, all these
positions are then multiplied by a factor which basically converts the block positions to the previously
defined coordinate system, centering the coordinates origin at the block centre. This is shown in Figure
4.14 for the 4×4 block used as example in Figure 4.11.
50
Figure 4.14 – Block positions (blue) converted to the adopted coordinate system (red) for the block in Figure
4.11.
Rotation computation – With the new coordinate system defined, each 3-D variable column vector is
then multiplied by the rotation matrix as done in Eq. (4.4) and the obtained values are rounded to the
nearest odd value. These results in a 3-D variable including column vectors representing the rotated
coordinates.
Rotated UMCP block computation – These rotated coordinates are then converted to the
corresponding block positions, performing the inverse operation of the operation illustrated in Figure
4.14. Additionally, the obtained rotated block positions are incremented by the value of the UMCP
block first pixel position. With this, the resulting positions are collected from the upsampled reference
frame to form the rotated UMCP block.
With this rotated UMCP block, it is then possible to process the corresponding shifts.
Shifts processing
After each rotation, the resulting rotated UMCP block can then be shifted according to a parameter δ, expressed
in pixels. With the rotated UMCP block still considered as part of the upsampled reference frame, these shifts are
made by simply incrementing and decrementing δ pixels to each rotated UMCP block pixel position coordinates.
This operation is illustrated in Figure 4.15, where a rotated UMCP block is shifted in all possible directions with
a shift parameter equal to δ.
Figure 4.15 – Shifts applied to a rotated UMCP block with a shift parameter equal to δ for the horizontal and
vertical directions.
From Figure 4.15, it is possible to conclude that the combination of all possible shifts results in 8 shifted UMCP
blocks for each considered rotation. Reusing the window analogy, consider now that the rotated window
(representing the rotated UMCP block) is then displaced δ pixels in all possible directions, leading to 8 new
dispositions, representing the 8 possible shifted and rotated UMCP blocks.
Concerning the available δ values, besides the 0.00, 0.25 and 0.50 pixel shifts used in [15], the developed coding
solution also considers the δ values of 0.75 and 1.00 pixels. The combination of all these shift parameters results
in the set of blocks shown in Figure 4.16.
51
Figure 4.16 – Set of shifted and rotated UMCP blocks for all possible δ combinations (for each θ).
The set in Figure 4.16 includes 80 shifted and rotated UMCP blocks (from block 2 to block 81) and 1 purely
rotated UMCP block (block 1). In Figure 4.16, the blue coloured blocks correspond to those blocks already used
in the solution proposed in [15] (25 in total), while the green coloured blocks correspond to the newly introduced
shift and rotations combinations.
At this stage, with both the rotation and shift computations concluded, the rotated and shifted UMCP blocks can
be downsampled, using the same technique adopted for the MCP block computation. To obtain the set of
estimation prediction error blocks, the set of rotated and shifted MCP blocks is then subtracted from the actual
MCP block. In the developed MATLAB script, these estimation prediction error blocks are stored in a 4-D
variable, with the third dimension corresponding to the available rotations and the fourth dimension
corresponding to the available shifts.
Considering that each TU can have a set of 81 estimated prediction error blocks for each rotation and there can
be a maximum of 5 different rotations, it is possible to obtain a maximum of 405 estimated prediction error
blocks (in the solution presented in [15], a total of 75 estimated prediction error blocks are used).
b. MKLT basis functions computation
With the set of estimated prediction error blocks determined, it is then possible to compute the MKLT basis
functions exactly like it is done in [15]. Thus, first, the covariance matrix Σ of the set of estimated prediction
error blocks needs to be determined. This is achieved by using the equation already presented in Section 3.1,
which defines the covariance between a pixel in position (u,v) and a pixel in position (r,s) for a set of n×n
estimated prediction error blocks as
(4.7)
where u, v, r, s = 0…(n-1), j = u+n.v, k = r+n.s, Ei(u,v) is the estimated prediction error in position (u,v) of the ith
block, is the mean value and N is the number of blocks in the set. To implement Eq. (4.7) in the developed
MATLAB scritpt, a function was programmed whose processing steps are described next:
First, the 4-D variable used to store the set of prediction error blocks is converted to a n2×N matrix, with
each column containing the pixel values of each block arranged in a vector. This conversion is done
with the reshape function [31] included in the MATLAB toolbox.
Then, each row of the obtained matrix, representing the pixel values of a particular position for all the
set blocks, is fixed (working as a pivot) and multiplied element-by-element to all the matrix rows
52
individually. Each of these multiplications results in a N row vector whose elements are summed and
then divided by N2.
As there are n2 rows in the matrix, each pivot row achieves n
2 results from the previous operation.
These results are arranged in a row vector representing the covariance of the pixel position
corresponding to a particular pivot row with all the other pixel positions.
Doing this for all the n2 rows of the matrix, results in a n
2×n
2 matrix representing the covariance matrix.
With the covariance matrix Σ of size n2×n
2 determined, it is then straightforward to compute the eigenvalues and
eigenvectors of this matrix, given by
Σ (4.8)
where is the matrix of eigenvectors and is the diagonal matrix of eigenvalues of the covariance matrix Σ
and both have size n2×n
2. To implement Eq. (4.8) in the developed MATLAB script, the eig function [32]
included in the MATLAB toolbox was used. This function returns the diagonal matrix of eigenvalues and the
matrix of eigenvectors for a particular input matrix, as intended. The transpose matrix of the eigenvectors matrix
represents the MKLT basis functions and can be used to compute the actual transform.
c. Forward MKLT computation
As noted before, the KLT is non-separable and the MKLT inherits this property. In this way, the actual
prediction error has to be arranged in a column vector before its transformation. This is done by laying all the
prediction error block pixels end to end, resulting in a n2 vector for a n×n prediction error block. The forward
MKLT for a input vector x is given by
(4.9)
where cMKLT is the n2 MKLT coefficients vector and TMKLT is the n
2×n
2 MKLT basis functions matrix. To have
some conformity with the DCT, the MKLT coefficients are then rearranged in a n×n block, denominated CMKLT,
using once again the MATLAB reshape function.
With both transforms performed, the following step is the quantization of the obtained coefficients.
4.4.5. Quantization
The quantization of both the DCT and MKLT coefficients is performed by means of a uniform quantizer, i.e., a
quantizer with fixed size for both the input decision intervals and the output reconstruction level differences
[33]. Thus, the quantized coefficients CQ are given by
(4.10)
where C are the coefficients (either DCT or MKLT coefficients) and Qstep is the adopted quantization step. This
quantization step is obtained as in the H.264/AVC standard using the following formula [34]
(4.11)
where QP is the quantization parameter and x%y defines the remainder of the division of x by y. The necessary
reference QP values with their corresponding Qstep values are shown in Table 4.2.
Table 4.2 – Reference QPs with the corresponding Qstep [34].
QP Qstep
0 0.625
1 0.702
2 0.787
3 0.884
4 0.992
53
5 1.114
6 1.250
With both the DCT and the MKLT coefficients quantized, it is then possible to entropy encode them.
4.4.6. Entropy Encoder
The entropy encoder module is the last module of the encoding process. Besides coding the quantized
coefficients to their corresponding bitstreams, this module is also used to decide which of the available
transforms must be used to code a particular TU. This decision is made based on the number of bits necessary to
represent each of the transforms coefficients. In this way, the transform which can be entropy encoded using
fewer bits is selected, and its bitstream is sent to the decoder side. The entropy encoder module architecture is
presented in Figure 4.17.
Figure 4.17 – Architecture of the entropy encoder module.
The entropy encoder module includes the following steps:
1) Transform coefficients scanning
To entropy encode the quantized coefficients, they have to be first arranged in a vector. To do this, the DCT
coefficients are scanned in zigzag order and the MKLT coefficients are rearranged into their original vector
form. In terms of implementation, this is done, for the DCT case, with a function programmed by the author of
this Thesis. This function receives a matrix containing the DCT coefficients and rearranges them according to
the zigzag scanning order, returning the corresponding vector. For the MKLT case, the conversion from a 2-D to
a 1-D representation is performed with the basic MATLAB manipulation tools.
2) Run-level encoder
Then, both coefficient vectors are coded using the run-level coding method used in JPEG [35]. In this method,
the encoder basically organizes the quantized coefficients vector in (run, level) pairs, where the run indicates the
number of null coefficients between the last and the current non-null coefficient and the level indicates the
quantized amplitude of the current coefficient.
54
To implement this encoder, a MATLAB function was programmed. This function uses an auxiliary variable to
store the number of null coefficients, which is initialized with the value 0. Then, each coefficient can be coded in
two different ways:
Null coefficient – If the currently being coded coefficient is null, the auxiliary variable is incremented
and the coding proceeds to the next coefficient.
Non-null coefficient – If the currently being coded coefficient is non-null, the number of null
coefficients since the last non-null coefficients (stored in the auxiliary variable) and the coefficients
amplitude are added to the output string as a (run, level) pair and the auxiliary variable is again
initialized to 0.
Doing this for all the coefficients in both transforms vectors, results in two strings with the corresponding (run,
level) pairs, one for each available transform coefficients.
3) LZ77 encoder
Finally, these strings comprised by (run, level) pairs are entropy encoded using the LZ77 lossless data
compression algorithm [36]. This algorithm is used mainly because of its simple implementation. Since the
entropy coding is not the object of study in this work, the author of this Thesis tried to find a solution that could
exploit the data statistical redundancy in the best possible way, without requiring too much time in its
development and implementation.
The LZ77 algorithm exploits the character redundancy in an input stream by replacing portions of the data with
references to matching data previously processed. To do this, a sliding window is used that is comprised by a
search buffer and a lookahead buffer. The search buffer goes from the beginning of the sliding window to the
character immediately before the current coding position. This buffer is used to search for data matches within
the lookahead buffer, which goes from the current coding position to the end of the sliding window. To better
understand this algorithm, consider the input stream in Figure 4.18 where the third character („B‟) is being
coded.
Figure 4.18 – LZ77 terminology considering the coding of the third character in the input symbol stream.
The output of this encoder is a sequence of (length, distance) pairs followed by the explicit character that was
not found in the search buffer. The length indicates the number of characters that the decoder has to go back in
order to find the beginning of the match, while the distance indicates the number of characters that the decoder
has to copy to its output. In this way, the encoder‟s output for the input stream in Figure 4.18 would be:
(0,0) A; (1,1) B; (0,0) C; (2,1) B; (5,2) End
After some tests performed with blocks of size 4×4, 8×8, 16×16 and 32×32, it was noted that the LZ77 encoder
provides higher compression factors for sliding windows of size n2
– 2 (considering a n×n block). In this way,
for 4×4 TUs, the sliding window size is fixed in 14 characters and, for 8×8 TUs, the sliding window size is fixed
in 62 characters. However, for 16×16 and 32×32 block sizes, the sliding windows are also defined with size 62,
since the use of larger window sizes does not bring a sufficient compression ratio improvement to compensate
the complexity increase. These tests were not performed for 64×64 blocks (another possible transform size in the
HEVC codec) because, at this stage, it was already decided that the maximum transform size used to test the
adopted coding solution would be 32×32 due to the very significant complexity increase caused by the
computation of 64×64 transforms.
To implement this entropy encoder, a MATLAB function was programmed with the following steps:
55
While the lookahead buffer is not empty, the search buffer is searched in order to find the longest match
with the lookahead buffer characters. One of two things can then happen:
o If a match is found, the function adds the corresponding (length, distance) pair and the next
character in the lookahead buffer (referred before as the explicit character) to the function
output. After this, the slinding window is shifted by a distance number of characters.
o If no match is found in the search buffer, the function adds a (length, distance) pair with values
(0,0) and the next character in the lookahead buffer to the function output.
In either case, after the previous step, the slinding window is shifted by the number of coded characters
(if a match was founded this number is equal to distance + 1, otherwise it is just 1). Then, the next
coding position is processed in the same way.
This method is processed until there are no more characters to code in the input stream. It has to be noted that
each (length, distance) pair element is coded with the same number of bits necessary to represent half the sliding
window size value in binary representation. In this way, the entropy decoder can clearly identify each used
(length, distance) pairs.
4) Decision module
With both transform bitstreams generated, it is then possible to select the one to be transmitted to the decoder
side. To do this, the size of the DCT and MKLT bitstreams are compared, and the one represented with fewer
bits is selected. For the decoder to recognize which transform was selected, an extra bit is included in the coded
bitstream; more precisely, a „0‟ is used for the DCT and a „1‟ is used for the MKLT.
With this, the adaptive transform bitstream is sent to the decoder and the encoding process is concluded. In the
next section, the decoding process is analyzed.
4.5. AT Decoder Functional Description and Implementation Details
After the functional description and the explanation of the implementation details of the HEVC framework and
the AT encoder processes, this section is dedicated to the explanation of the AT decoding process. From the
observation of Figure 4.1, it is possible to conclude that the decoding process includes four modules: the
reference frame upsampling, the MCP block computation, the entropy decoder, the inverse quantization, the
inverse adaptive transform and the frame reconstruction modules. These modules will be described and analyzed
in the following sections, with exception of the reference frame upsampling and the MCP block computation
modules that were already explained in the AT encoder section.
It has to be noted at this stage that the decoding process was not implemented independently from the encoder.
Instead, the inverse quantization and the inverse transform of each TU coefficients are performed immediately
after their transform and quantization in the developed MATLAB script. Thus, after the transform selection
made in the entropy encoder, the reconstructed prediction error block for the selected transform is copied to the
reconstructed prediction error frame, using the TU information obtained in the frame partitioning module. In this
way, some of the modules present in the decoding process were not really implemented, since the information
provided by them was already available as no real transmission was performed. This is the case of the entropy
decoder module, that decodes the adaptive transform quantized coefficients, and the frame reconstruction
module, that uses each TU split flag and coding mode to obtain the corresponding TU location and size in
relation to the frame. This approach does not influence the coding performance of the adopted coding solution,
but it just takes advantage of the fact that, in reality, both the encoding and decoding processes were
implemented in the same platform.
4.5.1. Entropy Decoder
To decode the AT bitstream created by the entropy encoder, an entropy decoder is used that basically performs
the inverse operation performed at the encoder. Once again, it has to be noted that this module was not
implemented in the developed MATLAB script, since the data provided by it was already known to the
implementation. The architecture of the entropy decoder module is presented in Figure 4.19.
56
Figure 4.19 – Architecture of the entropy decoder module.
A walkthrough of the processing block present in Figure 4.19 is now presented.
1) Selected transform bit extraction
The first operation to be performed is to read the extra bit that indicates the transform selected for the encoding
process of each TU („0‟ for the DCT and „1‟ for the MKLT).
2) LZ77 decoder
Then, a LZ77 decoder is used to process the (length, distance) pairs contained in the bitstream. This decoder
uses the same sliding window size used in the encoding process. A LZ77 decoder is implemented with the
following steps [37]:
For each (length, distance) pair followed by an explicit character, the value of the variable length is
verified and one of the following steps is taken:
o If length as a value equal to 0, the explicit character is printed to the decoder output.
o Otherwise, a distance number of characters are copied from the current output (before the
process of this step) starting from the character distanced by length positions from the last
position of this output. The copied characters are then added to the output, along with the
explicit character.
This process is done for all bitstream, resulting in a number (run, level) pairs, defined in the forward adaptive
transform module.
3) Run-level decoder
These pairs are then processed to be arranged in a vector. To implement this decoder, considering a n×n TU, a n2
vector must be first pre-allocated with 0s. Then, starting with a pointer to the first position of the vector, for each
(run, level) pair, the number of null coefficients (run) must be added to this pointer and the coefficient amplitude
(level) must be copied to the position indicated by the pointer. This is performed until all the (run, level) pairs are
processed.
4) Coefficients arrangement
Depending on the selected transform, the vector obtained with the run-level decoder is then arranged in a n×n
block using zigzag scanning (for the DCT) or sequential column order (for the MKLT). For the DCT case, if this
module was implemented, a MATLAB function would be developed performing exactly the opposite of the
function developed for the encoder process. Thus, this function would receive the vector with the DCT
coefficients arranged in the zigzag order, and would rearrange them returning a block of DCT coefficients. Once
again, for the MKLT, this operation is trivial using basic MATLAB functions; in this case, the reshape function
57
would be the ideal choice. In both cases, this operation results in a n×n block of reconstructed quantized
coefficients, C’Q.
The decoding process is then continued with the inverse quantization and the inverse adaptive transform
modules.
4.5.2. Inverse Quantization
Having the quantized coefficients levels, it is then necessary to scale them back to their original amplitude. In
this way, the inverse quantization of the n×n block of reconstructed quantized coefficients C’Q is given by
(4.12)
where C’ is the n×n block of reconstructed coefficients and Qstep is the quantization step, obtained in the same
way as in the forward adaptive transform module.
4.5.3. Inverse Adaptive Transform
The inverse adaptive transform module is used to reconstruct the prediction error block for each TU. To do this,
the coefficients received from the inverse quantizer are inverse transformed using the transform indicated by the
selected transform bit, received from the entropy decoder. The architecture of this module is presented in Figure
4.20.
Figure 4.20 – Architecture of the inverse adaptive transform module.
The architecture of Figure 4.20 includes the following steps:
1) Selection module
At this stage, it is then necessary to determine which inverse transform must be computed. This selection is
made according to the selected transform bit available from the entropy decoding process. If this bit is equal to
„0‟, then the inverse DCT is computed; if, on the other hand, the selected transform bit is equal to „1‟, then the
inverse MKLT is computed.
2) Inverse DCT
The inverse DCT of the n×n reconstructed coefficients block C’ is given by
(4.13)
where X’ is the n×n reconstructed prediction error block and TDCT is the n×n DCT basis functions matrix given
once again by
58
(4.14)
In practice, this computation is once again made by obtaining the DCT basis functions matrix with the
MATLAB dctmtx function and then computing the matrix multiplication in Eq. (4.13).
3) Inverse MKLT
If the forward adaptive transform module selected the MKLT, its basis functions have to be once again
computed as they were computed in the forward transform module. Thus, a set of estimated prediction error
blocks is first computed using the same technique described in the forward adaptive transform module. Then, the
estimated prediction error blocks set covariance matrix and the corresponding eigenvectors matrix are
determined. By computing the transpose of the eigenvectors matrix, it is possible to obtain the MKLT basis
functions.
With the MKLT basis functions available, the n×n reconstructed coefficients block C’ has to be arranged in a n2
vector of reconstructed coefficients c’. Once again, this is done because the MKLT inherits the non-separable
property of the standard KTL. Then, the inverse MKLT of the n2 reconstructed coefficients vector c’ is given by
(4.15)
where x’ is the n2 reconstructed prediction error vector and TMKLT is the n
2×n
2 MKLT basis functions matrix.
After the inverse MKLT computation, the reconstructed prediction error vector is rearranged in a n×n block,
representing the prediction error block. In terms of the implementation details, there is nothing to add to the
forward MKLT details explained earlier.
At this stage, independently of the selected transform, this module ends its processing, having obtained the
reconstructed prediction error block. Then, all TUs prediction error blocks are sent to the frame reconstruction
module to be arranged in the final prediction error frame.
4.5.4. Frame Reconstruction
As referred before, this module is used to arrange the various prediction error blocks, corresponding to the inter-
coded TUs in which the frame was partitioned, in a single frame. In the developed implementation, this is done
by simply copying the prediction error blocks pixel values to their corresponding location in a matrix with the
size of the original frame. The location of each prediction error block is the same as the location of the
corresponding TU. Thus, using the information about the first pixel position and the size of each TU, it is
straightforward to obtain the final reconstructed prediction error frame. However, considering two separate
encoding and decoding platforms, the frame partitioning module described in the encoding process would have
to be computed once again, using the TU split flags and coding modes, to obtain the necessary TU information
(i.e. first pixel position and size).
With this module, the decoding process is concluded.
4.6. Summary
In this chapter, the coding solution developed and implemented in this Thesis was presented to the reader. This
solution is based on the solution proposed in [15] as it uses an adaptive transform than can switch between the
DCT and a modified KLT, depending on the content that is being coded. The main differences regarding the
solution in [15] are related to the codec with which the adaptive transform is combined, the HEVC standard and
not the H.264/AVC standard, and the set of shift and rotation parameters used for the MKLT prediction error
estimation process. These differences were explained in detail in this chapter.
With this chapter, the reader was introduced not only to the main concepts behind the adopted solution, some
already presented in Section 3.1, but also to its implementation details. At this stage, it is possible to proceed to
the evaluation of the performance of the adopted coding. This is the objective of the next chapter which includes
a detailed performance assessment of the implemented coding solution.
59
Chapter 5
Performance Evaluation
The main purpose of this chapter is to evaluate the performance of the video coding solution designed in Chapter
4 combining the HEVC codec under development and the adaptive transform described in [15]. This assessment
is the natural final step to check the utility and effectiveness of this solution in the current video coding
landscape this means also in comparison with the relevant already available benchmarks. For this, a number of
experiments have been conducted with the proposed video coding solution, notably in terms of the adopted
transform. To achieve meaningful results, appropriate tests conditions have to be adopted; these conditions are
presented in the first section of this chapter, including the video sequences details and the coding parameters, as
well as the considered benchmarks and the metrics used to assess the coding performance. After, the test results
are presented, followed by their analysis.
5.1. Test Conditions
To evaluate the performance of the proposed video coding solution in a solid and reliable way, appropriate test
conditions have to be first defined. This is also done to avoid differences in the testing methodology from one
experiment to another which may lead to misleading results and conclusions. With this in mind, the next
subsections will present first the video sequences details and the coding parameters; after, the assessment metrics
and the benchmarks selected to evaluate the proposed coding solution performance are described.
5.1.1. Video Sequences
To obtain the results needed to evaluate the performance of the designed video coding solution, it is first
necessary to select the video sequences to be coded. These video sequences will play a major role in the obtained
results and derived conclusions, since their characteristics can heavily influence the behavior of the video codec
under test.
Spatial and temporal resolutions
For this study, two types of video resolutions have been used:
60
CIF resolution corresponding to 352×288 samples for the luminance and half this resolution in each
direction for the chrominances (4:2:0 content); for this spatial resolution, the adopted frame rate has been
30 fps. This resolution is used to allow the comparison of the adopted coding solution performance with
the results obtained in [15], which also used CIF resolution video sequences.
HD resolution corresponding to 1920×1080 samples for the luminance and half this resolution in each
direction for the chrominances (4:2:0 content); for this spatial resolution, the adopted frame rate has been
24 fps as this is the combination adopted by the JCT-VC team. This resolution has been selected as this
is one of the main target resolutions for the HEVC standard currently under development.
The selected video sequences corresponding to these resolutions are presented next.
CIF video sequences
Three CIF resolution video sequences have been selected: Container, Foreman and Mobile. All selected CIF
video sequences include 300 frames and the full sequences have been coded for the obtained performance
results. The first frames of these video sequences are presented in Figure 5.1.
Figure 5.1 – First frame of the selected CIF video sequences.
Figure 5.1 (a) shows the first frame of the Container video sequence. In this sequence, the video camera
basically follows a container ship movement (i.e. panning movement); this results in small motion activity. In
terms of spatial complexity, this video sequence includes rather homogenous areas, with a large portion of the
frame dominated by the sea. For this type of content, it is possible to use larger coding blocks, which provide
similar quality to the use of various smaller coding blocks, but using a smaller number of bits. The only spatial
detail requiring smaller coding blocks is the waving flag, whose movement cannot be too well predicted.
In Figure 5.1 (b), the first frame of the Foreman sequence is presented. The first frames of this sequence have
almost no motion, just including small movements of the speaking person‟s head. At approximately frame 160, a
fast camera panning is done, showing a completely new scenario with a building under construction. With this
panning, this sequence can be considered to have high motion activity. In terms of spatial details, it has to be
noted that the background building has some strong directional edges in the various floors separations that can
be harder to code with the DCT.
The first frame of the Mobile sequence is presented in Figure 5.1 (c). This sequence has consistent and medium
motion activity. However, it has a large number of spatial details, particularly in the calendar illustration; these
spatial details may cause the encoder to use smaller TU sizes in order to code them in a very efficient way.
HD video sequence
To test the adopted coding solution performance for HD video sequences, the Kimono sequence was selected.
Only one HD sequence was tested since the computation of the developed MATLAB script for this type of video
resolution takes a long time to fully process, and so this was the only sequence whose coding finished before this
Thesis submission. This video sequence includes 240 frames, but, in this case, only the first 50 frames were
coded with the adopted coding solution. This decision is related once again to the large amount of time needed to
code each of these high resolution sequences. The first frame of the Kimono sequence is presented in Figure 5.2.
61
Figure 5.2 – First frame of the selected HD video sequence: Kimono sequence.
The motion activity present in the first 50 frames of the Kimono sequence, which are the frames coded for this
sequence, is similar to the motion activity of the Container sequence. In this case, a smooth panning is made to
follow a woman‟s movement across a field with trees. With this, there are only small changes in the woman‟s
details, e.g. facial expression and clothes. The background changes slowly with the panning movement, but not
abruptly, since it is always dominated by trees and leaves.
5.1.2. Coding Conditions
After the definition of the test video sequences, it is then necessary to define the conditions and parameters used
in the coding process. Thus, the HEVC encoder configuration and the adaptive transform parameters are
specified in the following.
HEVC encoder configuration
To encode the selected test video sequences using the HEVC codec (TMuC software, version 0.9 [20]), the
“Random access, high-efficiency setting” defined by the JCT-VC team has been used (described in Section 4.3 of
[38]). This configuration was used since the objective of this work is to study the coding efficiency of the
developed coding solution and not its complexity and this is the appropriate JCT-VC defined configuration for
this purpose. This configuration is used with the following parameters:
Largest and smallest CTB size – To perform the tests, 32×32 and 4×4 sizes are used for the LCTB
and SCTB sizes, respectively. The HEVC encoder uses a rate-distortion optimization method to decide
how to partition the frame in CTBs. Thus, all the possible partitioning solutions are tested before this
decision is made. Since the maximum transform size was already defined to be 32×32 (as referred in
Chapter 4), there is no need to waste encoding time with CTBs bigger than 32×32.
Maximum and minimum transform size – The TU maximum and minimum sizes are also defined as
32×32 and 4×4, respectively. The maximum transform size is limited to 32×32 due to the high
computational time required by the application of 64×64 transforms, both for the HEVC codec and for
the developed MATLAB script, as already referred.
GOP structure and size – Each GOP starts with an intra-coded frame (I-frame) and is followed by P
inter-coded frames (P-frames) until its end (i.e. IPPP…P). The GOP size, corresponding to the period
between two intra-coded frames, is equal to 24 for both CIF and HD sequences.
Single reference frame – To simplify the implementation of the adopted coding solution, the motion
prediction is always based on only one reference frame, the previously coded frame.
Quantization parameters – To allow the performance evaluation of the various coding solutions for
several quality levels, five quantization parameters have been adopted: 16, 22, 27, 32 and 37. The last
four quantization parameters were selected according to the recommendation made by JCT-VC in [38];
the QP of 16 was added to extend the performance evaluation to higher bitrates. To determine the Qstep
values for the selected QPs, Eq. (4.11) was used, resulting in the Qstep values presented in Table 5.1.
62
Table 5.1 – Selected QPs and their corresponding Qstep values.
QP Qstep
16 63.4880
22 126.9760
27 226.3040
32 402.9440
37 718.8480
Finally, it has to be noted that the deblocking filter and the rotational transform were disabled in the TMuC
software. For the deblocking filter, this was done to allow the correct verification of the extracted data without
the need to recreate this process. The rotational transform was disabled to save the required computation effort as
its use was not essential in the context of this study. With the HEVC codec configuration defined, the adaptive
transform parameters are described next.
Adaptive transform parameters
To compare the adopted coding solution performance with the benchmarks defined in the following, three
coding modes for the proposed adaptive transform have been defined with the following parameters:
Adaptive transform with half range shift and rotation parameters – This mode of the adaptive
transform uses a Half Range shift and rotation parameters Set (HRS) to compute the MKLT basis
functions. This means that the maximum shift parameter is δ = 0.5 pixels and the maximum rotation
parameter is θ = 0.5°; this AT mode is basically the same as used in [15].
Adaptive transform with full range shift and rotation parameters – In this mode, the used MKLT
basis functions are computed with a Full Range shift and rotation parameters Set (FRS). This means
that the maximum shift parameter is δ = 1.0 pixels and the maximum rotation parameter is θ = 1.0°; this
AT mode uses more shifts and rotations to estimate the prediction error than those used in [15].
Adaptive transform with HRS and FRS – With this adaptive transform coding mode, the MKLT is
basically divided into two MKLTs: one using the HRS mode and another using the FRS mode. In this
way, at the decision module, the selection is made between 3 transforms: the DCT, the MKLT with
HRS and the MKLT with FRS. It has to be noted that the inclusion of this AT mode requires 2 bits to
signal the selected transform, instead of just 1 bit as used for the AT solution scheme proposed in [15].
With the coding conditions clearly defined, the metrics used to evaluate each solution performance are presented
next.
5.1.3. Performance Evaluation Metrics
To evaluate the performance of the various coding solutions, their Rate-Distortion (RD) curves are obtained.
These curves are obtained by plotting the objective quality metric value for the reconstructed prediction error as
a function of the amount of bits per second needed to code it. In the following, the adopted objective quality
metric is the PSNR as it is commonly done in the literature. The PSNR and the bitrate metrics are defined as:
PSNR – The Peak Signal-to-Noise Ratio is a metric used to measure the ratio between the maximum
possible power of a signal and the power of the corrupting noise affecting the fidelity of its
representation [39]. In the video coding context, the PSNR is used to measure the objective quality of
the decoded signal in comparison to the original input signal. It is commonly defined via the Mean
Squared Error (MSE), which, for a m×n frame, is given by
(5.1)
where O(i, j) is the original input signal at position (i, j) and D(i, j) is the decoded signal at position (i, j).
With this, the PSNR is given by
63
(5.2)
where MAX is the maximum value of the input signal (255 for 8 bits samples). In the context of this study,
the PSNR is used to measure the objective quality of the reconstructed prediction error in comparison to the
actual prediction error.
Bitrate – The bitrate metric is typically used to measure the amount of data needed to code the original
input signal. However, in the context of this work, it represents the number of bits per second (bits/s)
needed to code a particular video sequence prediction error. In this case, the bitrate is not defined and
controlled directly through a rate control tool but it is the result of selecting a particular QP.
Naturally, the RD curves are expected to show lower PSNR values for the lower bitrates and higher PSNR
values for the higher bitrates. With the RD curves, it is possible to evaluate the average PSNR improvements and
the average bitrate savings of a coding solution against another. This may be performed by means of a largely
used metric called Bjontegaard metric [40], which is described next:
Average PSNR improvement of one solution versus another – To measure the average PSNR
improvement of a particular coding solution over another, first, both coding solutions RD curves bitrate
axes have to be converted to a logarithmic scale. Then, the resulting RD curves are approximated by
cubic functions, in what is called a fitting process. With the cubic functions for both coding solutions
available, it is then simple to compute the integral of both functions in a given interval (ranging from
the minimum to the maximum available bitrate values). The computation of the difference between
these two integral values results in the average PSNR improvement between the two coding solutions.
Average bitrate savings of one solution versus another – To make an evaluation of the average
bitrate savings between two coding solutions, the RD curves have to be first inverted, in order to
provide the bitrate as a function of the PSNR. Then, the bitrate axes are again converted to the
logarithmic scale and the resulting RD curves are approximated by a cubic function. With this, it is then
possible to compute the difference between the integrals of both coding solutions cubic functions in the
same interval (ranging from the minimum to the maximum available PSNR values), resulting in the
average bitrate saving between them.
The Bjontegaard metric described above has been computed by means of a MATLAB script developed by
Valenzise, available in [41].
5.1.4. Coding Benchmarks
Clearly, the main feature of the developed video coding solution is the adaptive transform. In this way, the
performance evaluation reported in this chapter had to focus on the coding performance changes related to the
use of this transform coding tool, notably in comparison with the usual DCT. To access these changes, the
following coding solutions are benchmarked:
HEVC with DCT (DCT) – In this codec, relevant coding data is extracted from the HEVC codec, as
explained in Chapter 4, and after, instead of the proposed adaptive transform, the DCT is used for all
inter-coded TUs.
HEVC with the adaptive transform with HRS(AT HRS) – With this codec, the relevant coding data
is once again extracted from the HEVC codec and then the inter-coded TUs are transformed using the
proposed adaptive transform with HRS.
HEVC with adaptive transform with FRS (AT FRS) – To test the performance changes related to the
introduction of the new shift and rotation parameters, a codec using the proposed adaptive transform
with FRS is also used. Once again, this codec uses the relevant coding data extracted from the HEVC
codec.
HEVC with adaptive transform with HRS and FRS (AT HFRS) – Finally, the performance of a
codec using the adaptive transform with both HRS and FRS is assessed. This is a special case of the
proposed adaptive transform, which may choose between 3 transforms as explained before. To perform
the coding of the inter-coded TUs, this codec also uses the relevant coding data extracted from the
HEVC framework.
64
With these benchmarks, it is possible to evaluate the relative coding performance of the proposed adaptive
transform with the following objectives in mind:
To assess if the proposed HEVC with adaptive transform solution can obtain similar performance gains
to those achieved in [15] (where the adaptive transform was integrated in the H.264/AVC codec), but
now in the context of the emerging HEVC standard.
To evaluate the coding performance of the proposed HEVC with adaptive transform solution for high
definition video content, whose adoption is growing quickly, when compared to its coding performance
for lower resolution video sequences (i.e. CIF).
To evaluate the coding performance of the proposed HEVC with adaptive transform solution when
using larger shift and rotation parameters than those used in [15].
Having clearly defined the test contents, conditions and benchmarks, the results of the experiments performed
are presented next, followed by their analysis.
5.2. Results and Analysis
To analyze the obtained experimental results, the RD performance results obtained for the various tested video
sequences are presented first for the CIF video sequences and after for the HD sequence. These results include
RD curves for the three individual transforms that can be selected by the adaptive transform - DCT, MKLT HRS
and MKLT FRS - and also for the RD curves corresponding to the four codecs defined as benchmarks: the
HEVC codecs using the proposed adaptive transform – AT HRS, AT FRS and AT HFRS – and the HEVC codec
using the DCT. Naturally, the DCT transform RD curve corresponding to the individual transform comparison
and the HEVC with DCT RD curve are going to be the same. Additionally, the Bjontegaard metric is applied to
each adaptive transform mode versus the DCT, to measure the average PSNR improvement and the average
bitrate saving for each one of these codecs.
Besides the RD performance-based results, the statistics about the used TU sizes and the selected transforms for
each adaptive transform codec are also presented. These results are used to better understand the proposed
adaptive transform selection process.
5.2.1. Performance for CIF Resolution Video Sequences
As mentioned before, three CIF resolution video sequences have been coded to assess the adopted coding
solution: Container, Foreman and Mobile. The RD performance results obtained with these video sequences are
presented in the following.
Container video sequence
The Container sequence RD performance obtained for the DCT, the MKLT HRS and the MKLT FRS is shown
in Figure 5.3. From this figure, it is possible to take the following conclusions:
The first observation that has to be made is related to the achieved bitrate values for this and for all the
tests performed in this work. It is clear that these values are extremely higher than the ones achieved
with the actual state-of-the-art video coding standard. However, the author has perfect notion of this
fact and, thus, these values are never used in their absolute form to take any type of conclusions about
the adopted coding solution performance. Instead, these values are always analyzed in a relative way.
The reason behind these high bitrate values is principally related to the entropy coder used in this
solution. As explained in Chapter 4, it is a very simple entropy coder which is not the object of study of
this work. Additionally, is also true that the HEVC codings are always performed using only P-frames
and a single reference frame, which do not allow the exploitation of all the motion prediction tools.
Comparing the MKLTs performance to the DCT, it is possible to see that both MKLTs can only
outperform the DCT for the low bitrates (corresponding approximately to the first two QPs). For
bitrates larger than 2 Mbit/s, the DCT starts to offer better prediction error PSNR than the MKLTs
(considering the same bitrate). This improvement becomes even more evident for the higher bitrates.
Comparing now the performance of both MKLTs, it seems that the use of the FRS versus the HRS
(used in [15]) can bring slight improvements in terms of RD performance. This improvement is almost
imperceptible for the lower bitrates, where the RD curve for the MKLT FRS and the MKLT HRS are
basically the same.
65
Figure 5.3 – Container sequence RD performance for the DCT, MKLT HRS and MKLT FRS.
To see how these results can influence the adaptive transform performance for the Container sequence, Figure
5.4 shows the RD performance for the DCT, AT HRS, AT FRS and AT HFRS codecs, while Table 5.2 shows
the results of the Bjontegaard metric for the three adaptive transform modes against the DCT.
Figure 5.4 – Container sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS codecs.
66
Table 5.2 – Container sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT.
Benchmark Average PSNR improvement
(dB) Average bitrate saving (%)
AT HRS versus DCT 0.54 6.5
AT FRS versus DCT 0.54 6.5
AT HFRS versus DCT 0.71 8.4
From the analysis of Figure 5.4 and Table 5.2, it is possible to conclude the following:
First, it is possible to observe that all adaptive transform modes (AT HRS, AT FRS and AT HFRS)
bring performance improvements in comparison to the codec only using the DCT. As expected from the
observation of Figure 5.3, these improvements are more noticeable for lower bitrates, approaching the
DCT RD curve for higher bitrates. For the entire bitrate range, the AT HFRS is always the best
performing codec in terms of RD performance, has proven by the average PSNR improvement (0.71
dB) and the average bitrate saving (8.4%) when compared to the DCT.
It can also be noted that visual inspection allows concluding that the slight performance improvement
offered by the MKLT FRS in relation to the MKLT HRS does not materialize in any perceptual coding
gain when using the adaptive transform. In this way, both the AT HRS and the AT FRS have the same
average PSNR improvement (0.54 dB) and average bitrate saving (6.5%) in comparison to the DCT.
To better understand how these performance gains are achieved, the transform selection made by the adaptive
transform is now analyzed in detail. For this effect, Table 5.3 shows the percentage of inter-coded TUs for each
QP and TU block size and Table 5.4 shows the percentage of TUs coded with each available transform for each
AT codec, QP and TU block size, all this for the Container sequence.
Table 5.3 – Container sequence percentage of inter-coded TUs for each QP and TU block size.
QP TU sizes
4×4 8×8 16×16 32×32
16 54% 36% 9% 2%
22 44% 38% 14% 4%
27 33% 35% 20% 13%
32 20% 28% 23% 30%
37 10% 18% 21% 51%
Starting by the analysis of Table 5.3, it is possible to observe that, for lower QP values (thus higher bitrate
values), the HEVC codec tends to select a higher percentage of smaller TU sizes (54% of the TUs having size
4×4, 36% size 8×8, 9% size 16×16 and only 2% size 32×32), since these blocks can achieve better performance
in a RD optimization sense by offering a better coding efficiency. By increasing the QP values (thus reducing
the bitrate), the TU partition becomes more balanced in terms of the percentage of block sizes selected, with a
gradual increase of the larger blocks selection (and a consequent reduction of the smaller blocks selection). For
the higher QP values (thus lower bitrates), the selection pattern observed for lower QPs is completely reversed
with the larger TU sizes selected for the majority of the cases (10% of the TUs having size 4×4, 18% size 8×8,
21% size 16×16 and 51% size 32×32). Although the TU partitioning pattern depends greatly on the video
sequence motion activity and spatial details, it can be said that this trend, i.e., the reduction of the selection of the
smaller blocks and the increment of the use of the larger blocks with a QP value increase, is observed for all the
studied cases.
Focusing now on the results shown in Table 5.4, the following conclusions may be driven:
For all TU sizes and adaptive transform codecs, an increase of the QP value results into an increase of
the percentage of TUs coded with the MKLTs. This increase is more noticeable for the larger TU block
67
sizes, e.g. the MKLT HRS used in the AT HRS is selected for only 26% of the 32×32 TUs for a QP of
16, but, for a QP of 37, it is selected 95% of the times for the same TU size.
For smaller block sizes (4×4 and 8×8 TUs), the choice between the DCT and the MKLT in the AT HRS
and the AT FRS codecs is fairly balanced, with the maximum difference occurring for the 8×8 TUs and
QP of 16 in the AT HRS case, where the DCT is selected 69% of the times. For the AT HFRS case, the
MKLTs are selected, on average, 64% of the 4×4 TUs and 61% of the 8×8 TUs.
For larger block sizes (16×16 and 32×32 TUs), the disparity between the DCT and the MKLTs
selection is, in general, very high. For the lower QP values (16 and 22), the DCT is selected in the
majority of the cases for all the available AT codecs, on average, 68% of the times. It has to be noted
that, for these QP values, these larger blocks are not used very often (as referred before). For higher QP
values (27, 32 and 37), the MKLTs become the most used transforms for all the AT codecs, being
selected, on average, 83% of the times.
Comparing the AT HRS with the AT FRS, it is possible to conclude that the percentage of TUs coded
with the MKLT HRS is practically the same as the percentage of TUs coded with the MKLT FRS. On
the other hand, from the AT HFRS results, it is possible to observe that the MKLT FRS is selected
more times than the MKLT HRS. This was expected since the decision module selects the MKLT FRS
for the cases where both MKLTs bitstreams have the same number of bits. With this option, the author
pretends to use the novel shift and rotation parameters in the maximum number of opportunities
possible to evaluate the performance changes introduced by their utilization. If the developed solution
complexity was the major requirement, instead of its coding performance, clearly the MKLT HRS
should be the one to be selected in these cases.
It can also be noted that the use of both MKLTs in the same codec increases the MKLT percentage use
in comparison with the situation where they are used independently. This happens for all the studied
cases.
These results show that, for a CIF video sequence with low motion activity and few spatial details like the
Container sequence, the best performing adaptive transform (AT HFRS) can bring coding improvements over
the DCT of 0.71 dB in terms of objective prediction error quality and of 8.4% in terms of bitrate savings. They
also show that the use of the FRS alone cannot bring significant performance gains to the adaptive transform
(this means regarding the HRS). However, when used in an adaptive transform combining both the available
shift and rotation parameter sets, it can provide a considerable performance improvement in comparison to the
adaptive transform with only the HRS (as used in [15]).
68
Table 5.4 – Container sequence percentage of TUs coded with the available transforms for each AT codec, QP
and TU block size.
Codec Selected
transform
TU sizes
4×4 8×8 16×16 32×32
QP = 16
AT HRS DCT 53% 69% 86% 74%
MKLT HRS 47% 31% 14% 26%
AT FRS DCT 52% 67% 85% 74%
MKLT FRS 48% 33% 15% 26%
AT HFRS
DCT 38% 54% 78% 70%
MKLT HRS 27% 21% 10% 14%
MKLT FRS 35% 25% 12% 15%
QP = 22
AT HRS DCT 50% 60% 68% 64%
MKLT HRS 50% 40% 32% 36%
AT FRS DCT 50% 59% 67% 64%
MKLT FRS 50% 41% 33% 36%
AT HFRS
DCT 36% 45% 57% 59%
MKLT HRS 26% 25% 20% 18%
MKLT FRS 38% 30% 23% 23%
QP = 27
AT HRS DCT 49% 52% 49% 37%
MKLT HRS 51% 48% 51% 63%
AT FRS DCT 49% 51% 47% 36%
MKLT FRS 51% 49% 53% 64%
AT HFRS
DCT 36% 37% 37% 30%
MKLT HRS 25% 28% 29% 31%
MKLT FRS 40% 35% 34% 39%
QP = 32
AT HRS DCT 47% 46% 34% 15%
MKLT HRS 53% 54% 66% 85%
AT FRS DCT 47% 46% 34% 14%
MKLT FRS 53% 54% 66% 86%
AT HFRS
DCT 35% 32% 24% 10%
MKLT HRS 23% 28% 31% 38%
MKLT FRS 42% 40% 45% 52%
QP = 37
AT HRS DCT 49% 42% 25% 5%
MKLT HRS 51% 58% 75% 95%
AT FRS DCT 48% 41% 25% 5%
MKLT FRS 52% 59% 75% 95%
AT HFRS
DCT 36% 29% 16% 2%
MKLT HRS 22% 27% 30% 28%
MKLT FRS 42% 44% 54% 70%
69
Foreman video sequence
After the presentation and analysis of the performance results for the Container sequence, Figure 5.5 shows the
obtained RD performance for the DCT, the MKLT HRS and the MKLT FRS transforms for the Foreman video
sequence.
Figure 5.5 – Foreman sequence RD performance for the DCT, MKLT HRS and MKLT FRS.
From Figure 5.5, it is possible to conclude:
First, the DCT clearly outperforms both MKLTs when used for all the video sequence TUs, thus always
providing better objective quality for the same bitrate. This does not mean that the MKLTs are not
useful at all in a more adaptive coding solution, as there can be a number of TUs which might be coded
more efficiently using a MKLT than a DCT; naturally, this will require a more complex, adaptive
transform.
It is also possible to observe that the use of the extended range of shift and rotation parameters (FRS)
can provide slightly better RD performance when compared to the HRS approach used in [15]. It
remains to be seen if this RD performance improvement is also reflected in the adaptive transform
performance. Moreover, since these RD performance gains are rather small, the associated complexity
increase may not be worthwhile.
To evaluate the proposed adaptive transform performance for the Foreman sequence, Figure 5.6 shows the RD
performance for the DCT and the three previously defined adaptive transform modes. Table 5.5 shows the
Bjontegaard metric results for the same adaptive transform modes versus the DCT.
70
Figure 5.6 – Foreman sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS.
Table 5.5 – Foreman sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT.
Benchmark Average PSNR improvement
(dB) Average bitrate saving (%)
AT HRS vs. DCT 0.31 4.6
AT FRS vs. DCT 0.32 4.6
AT HFRS vs. DCT 0.44 6.4
From Figure 5.6 and Table 5.5, the following analysis can be made:
First, it is possible to conclude that all the three adaptive transform modes can achieve better RD
performance than the DCT alone for all the tested bitrate range, with average PSNR improvements
varying from 0.31 to 0.44 dB and average bitrate savings from 4.6 to 6.4 %.
It can also be confirmed that the use of the FRS mode brings performance improvements to the adaptive
transform in comparison to the HRS mode, but these improvements are almost meaningless in the video
coding context (only 0.01 dB of average PSNR improvement). On the other hand, the use of the FRS
mode in combination with the HRS mode can bring a more significant improvement in relation to the
use of the HRS mode alone, despite the need of one more bit for the transform signalling (0.13 dB of
average PSNR improvement and 1.8% of average bitrate reduction). Again, this implies an encoding
and decoding complexity increase that needs to be assessed regarding the RD performance gains.
Following the RD performance results of the Foreman sequence, Table 5.6 shows the percentage of inter-coded
TUs for each QP and TU block size; moreover, Table 5.7 shows the percentage of TUs coded with the available
transforms for each AT codec, QP and TU block size, for the Foreman video sequence.
71
Table 5.6 – Foreman sequence percentage of inter-coded TUs for each QP and TU block size.
QP TU sizes
4×4 8×8 16×16 32×32
16 69% 27% 4% 0%
22 55% 35% 8% 1%
27 34% 45% 17% 4%
32 15% 42% 30% 13%
37 4% 34% 34% 29%
Comparing the results in Table 5.6 with those obtained with the Container sequence, it is possible to see that the
sequence Foreman tends to use a higher percentage of smaller TU blocks. This was expected since the Foreman
sequence has higher motion activity and more spatial details than the Container sequence. However, it can still
be verified that the use of larger TU blocks grows with the QP value. In this particular case, only the 4×4 TUs
show a decreasing use with the raise of the QP. From Table 5.6, it is also possible to conclude that, for QP
values of 16 and 22, the use of 32×32 TUs is almost inexistent.
Table 5.7 shows very similar results to those obtained for the sequence Container. Still, the following
conclusions may be taken:
Once again, the percentage of TUs coded with the MKLTs increases with the QP value. This is
especially noticeable for the larger TUs, as for the Container sequence, but also for the 8×8 TUs which
show a similar behaviour to the larger blocks in this case (e.g. for the AT HRS codec, the MKLT HRS
is selected only for 28% of 8×8 TUs for a QP of 16, while it is selected for 66% of the same TUs for a
QP of 37).
In this case, the smaller TUs (4×4 and 8×8) have more importance, since there are more TUs of these
sizes than in the previous sequence. In this way, the MKLTs are selected for the 4×4 TUs, on average,
54% of the times, while the 8×8 TUs (which are the most used type of TU for this sequence) are coded
by the MKLTs, on average, 38% of the times.
The larger TUs (16×16 and 32×32), which are rarely used for the first three QP values (16, 22 and 27),
show once again a large disparity between the first and the last QP (i.e. 16 and 37). For 16×16 TUs,
49% of the blocks are coded with the MKLTs, on average, while for the 32×32 TUs this number raises
to 58%.
Once again, the MKLT FRS is not selected more times than the MKLT HRS when operating
independently with the AT FRS and AT HRS codecs. However, as for the Container sequence, the
MKLT FRS is, in general, selected more often than the MKLT HRS, for the AT HFRS.
These results show that, for a CIF video sequence with a high amount of motion activity and medium spatial
details complexity such as the Foreman video sequence, the best adaptive transform (MKLT with HRS and
FRS) can bring performance improvements over the DCT of 0.44 dB in terms of objective reconstructed
prediction error quality and 6.4% in terms of bitrate savings (always on average). Once again, it was verified that
the use of FRS can only bring coding gains when combined with the HRS mode.
72
Table 5.7 – Foreman sequence percentage of TUs coded with the available transforms for each AT codec, QP
and TU block size.
Codec Selected
transform
TU sizes
4×4 8×8 16×16 32×32
QP = 16
AT HRS DCT 56% 72% 88% 88%
MKLT HRS 44% 28% 12% 12%
AT FRS DCT 55% 71% 88% 87%
MKLT FRS 45% 29% 12% 13%
AT HFRS
DCT 41% 59% 82% 85%
MKLT HRS 27% 19% 8% 7%
MKLT FRS 32% 22% 10% 8%
QP = 22
AT HRS DCT 53% 63% 70% 63%
MKLT HRS 47% 37% 30% 37%
AT FRS DCT 53% 62% 69% 64%
MKLT FRS 47% 38% 31% 36%
AT HFRS
DCT 39% 49% 59% 56%
MKLT HRS 25% 23% 18% 18%
MKLT FRS 36% 28% 23% 25%
QP = 27
AT HRS DCT 52% 54% 53% 37%
MKLT HRS 48% 46% 47% 63%
AT FRS DCT 51% 54% 52% 37%
MKLT FRS 49% 46% 48% 63%
AT HFRS
DCT 38% 41% 42% 30%
MKLT HRS 23% 25% 25% 30%
MKLT FRS 39% 34% 33% 40%
QP = 32
AT HRS DCT 49% 44% 39% 21%
MKLT HRS 51% 56% 61% 79%
AT FRS DCT 48% 44% 37% 21%
MKLT FRS 52% 56% 63% 79%
AT HFRS
DCT 37% 33% 29% 16%
MKLT HRS 20% 24% 28% 36%
MKLT FRS 43% 43% 43% 48%
QP = 37
AT HRS DCT 44% 34% 21% 10%
MKLT HRS 56% 66% 79% 90%
AT FRS DCT 43% 35% 21% 9%
MKLT FRS 57% 65% 79% 91%
AT HFRS
DCT 34% 25% 15% 7%
MKLT HRS 15% 23% 29% 35%
MKLT FRS 51% 52% 56% 58%
73
Mobile video sequence
After the analysis of the results for the Foreman sequence, the results obtained for the Mobile video sequence are
presented next. In Figure 5.7, the RD performance for the available individual transforms to be used later by the
adaptive transform are shown.
Figure 5.7 – Mobile sequence RD performance for the DCT, MKLT HRS and MKLT FRS.
From the results in Figure 5.7, the following conclusions can be taken:
Once again, the DCT shows a better RD performance than the two available MKLTs, particularly for
the higher bitrates.
The use of the FRS mode brings once again residual RD performance benefits in comparison to the
HRS mode. Taking in consideration the two previously studied cases (Container and Foreman
sequences), it is expectable that this improvement will not be reflected in the RD performance of the
AT FRS versus the AT HRS solutions. However, again from the previous results, it is predictable that
the use of both MKLTs in the AT HFRS will bring some RD performance improvement over the other
two adaptive transform modes.
Next, the RD performance for the DCT and the three available adaptive transform modes are presented in Figure
5.8, followed by the corresponding Bjontegaard metric results also for these three adaptive transform versus the
DCT in Table 5.8.
74
Figure 5.8 – Mobile sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS.
Table 5.8 – Mobile sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT.
Benchmark Average PSNR improvement
(dB) Average bitrate saving (%)
AT HRS vs. DCT 0.47 4.0
AT FRS vs. DCT 0.49 4.2
AT HFRS vs. DCT 0.68 5.8
The obtained results lead to the following conclusions:
As for the previous video sequences, all the adaptive transform modes can bring performance
improvements when compared to the DCT used alone. In this case, these improvements range from
0.47 to 0.68 dB for the average PSNR improvement and from 4.0 to 5.8 % for the average bitrate
reduction.
As expecteed, both prediction error estimation modes (FRS and HRS) provide similar RD performance
when used independently; with the FRS mode performing slightly better (0.02 dB of average PSNR
improvement and 0.2 % of average bitrate reduction). The combination of these two prediction error
estimation modes provides once again the best solution in terms of RD performance (with a 0.21 dB
improvement of the average PSNR and 1.8% of average bitrate saving in comparison to the adaptive
transform with the HRS mode).
Consider now the percentage of inter-coded TUs for each QP and TU block size for the Mobile sequence in
Table 5.9. The percentage of TUs coded with the available transforms for each AT codec, QP and TU block size
for the same sequence are presented in Table 5.10.
75
Table 5.9 – Mobile sequence percentage of inter-code TUs for each QP and TU block size.
QP TU sizes
4×4 8×8 16×16 32×32
16 82% 16% 2% 0%
22 77% 20% 2% 0%
27 69% 27% 3% 0%
32 53% 39% 7% 1%
37 32% 45% 19% 4%
As referred in Section 5.1.1, the Mobile sequence has a large amount of spatial details. Thus, as expected, the TU
partitioning for this sequence is largely dominated by smaller TU blocks, with 4×4 and 8×8 sizes. Despite this,
the 16×16 TUs still have a significant use for a QP of 37 (with 19% of the blocks). On the other hand, the TUs
with size 32×32 are almost inexistent, with a maximum of 4% of blocks for a QP of 37. By observation of Table
5.10, the following conclusions may be taken:
Like the two previous cases (Container and Foreman sequences), the percentage of selected MKLTs
for every AT codec and TU block size grows with the QP increase. Once again, this growth is more
evident for the larger sized TUs.
For the smaller TUs, the selection between the DCT and the MKLTs is very balanced, with an average
of 54% and 47 % of the coded TUs selecting the MKLTs for 4×4 and 8×8 TUs, respectively.
For the larger TUs, it is only important to analyze the results for the 16×16 TUs for a QP of 37, since
all the other results are obtained for an insignificant number of TUs. Thus, the MKLTs are selected for
the 16×16 TU coding, on average, 61% of the times, for a QP of 37.
In terms of the comparison between the MKLT HRS and the MKLT FRS, both transforms show similar
selection percentages when operated individually. However, once again, for the AT HFRS codec, the
MKLT FRS is selected more often than the MKLT HRS.
The previous results show that, for a video sequence with medium amount of motion activity and high spatial
detail complexity such as the Mobile video sequence, the proposed coding solution can bring an average PSNR
improvement of 0.68 dB and an average bitrate saving of 5.8% for the reconstructed prediction error over the
DCT. With all results of the three selected CIF resolution video sequences presented and analyzed, the following
conclusions may be taken regarding the adopted coding solution for this low resolution:
The codec using the adaptive transform with HRS can achieve an average objective prediction error
quality improvement of 0.44 dB and average bitrate savings of 5% over the codec only using the DCT.
These results cannot be directly compared to those in [15], since for this case the PSNR is only
measured for the prediction error and the adaptive transform is not fully integrated in a video coding
standard. Still, for the Mobile sequence, which was also coded in [15], the results obtained with the
HEVC data are inferior to those obtained in [15] with the H.264/AVC codec, with 4% of bitrate
savings obtained in the test made with the adopted HEVC based coding solution against the 20%
obtained in [15].
From the performance results for the CIF resolution sequences, it is possible to conclude that the use of
a MKLT with a FRS does not bring any significant improvement to the adaptive transform
performance (while is increased the complexity).
The third adaptive transform mode, the AT HFRS, which combines the HRS and FRS modes, can
achieve an average prediction error PSNR improvement of 0.61 dB and average bitrate savings of 7%
over the DCT. With these results, it is possible to state that the introduction of a FRS mode can
improve the adaptive transform performance when using both the available shift and rotation
parameters sets.
Finally, it has to be noted that the TU partitioning used in all the performed tests is decided by the HEVC
encoder in a RD optimization sense. In this way, the decision is made based on the performance obtained with
the HEVC DCT solution. Thus, it remains to be seen if the effective integration of the adaptive transform in the
HEVC codec would cause a different TU partitioning influenced by the use of the MKLT.
76
Table 5.10 – Mobile sequence percentage of TUs coded with the available transforms for each AT codec, QP
and TU block size.
Codec Selected
transform
TU sizes
4×4 8×8 16×16 32×32
QP = 16
AT HRS DCT 53% 71% 90% 99%
MKLT HRS 47% 29% 10% 1%
AT FRS DCT 53% 70% 89% 99%
MKLT FRS 47% 30% 11% 1%
AT HFRS
DCT 38% 58% 85% 98%
MKLT HRS 28% 19% 7% 1%
MKLT FRS 34% 23% 8% 1%
QP = 22
AT HRS DCT 52% 63% 80% 95%
MKLT HRS 48% 37% 20% 5%
AT FRS DCT 51% 62% 79% 94%
MKLT FRS 49% 38% 21% 6%
AT HFRS
DCT 37% 50% 73% 93%
MKLT HRS 27% 22% 11% 3%
MKLT FRS 36% 28% 16% 4%
QP = 27
AT HRS DCT 51% 57% 69% 81%
MKLT HRS 49% 43% 31% 19%
AT FRS DCT 50% 57% 68% 80%
MKLT FRS 50% 43% 32% 20%
AT HFRS
DCT 36% 44% 59% 74%
MKLT HRS 26% 24% 18% 11%
MKLT FRS 38% 32% 24% 14%
QP = 32
AT HRS DCT 50% 52% 58% 56%
MKLT HRS 50% 48% 42% 44%
AT FRS DCT 50% 52% 56% 55%
MKLT FRS 50% 48% 44% 45%
AT HFRS
DCT 37% 38% 45% 44%
MKLT HRS 24% 25% 23% 26%
MKLT FRS 39% 36% 31% 30%
QP = 37
AT HRS DCT 49% 45% 45% 29%
MKLT HRS 51% 55% 55% 71%
AT FRS DCT 48% 45% 42% 28%
MKLT FRS 52% 55% 58% 72%
AT HFRS
DCT 36% 32% 31% 19%
MKLT HRS 21% 27% 29% 37%
MKLT FRS 43% 41% 40% 44%
77
5.2.2. Performance for HD Resolution Video Sequences
After the presentation and analysis of the performance results obtained for the CIF resolution video sequences,
the selected HD resolution video sequence (Kimono sequence) performance is now analyzed. To start with,
Figure 5.9 presentes the RD performance of the Kimono sequence for the DCT, MKLT HRS and MKLT FRS
individual transform solutions.
Figure 5.9 – Kimono sequence RD performance for the DCT, MKLT HRS and MKLT FRS.
From Figure 5.9, the following analysis can be made:
In comparison to the DCT RD curve, both MKLTs clearly offer worse RD performance. This
performance difference is more noticeable for the bitrates defined by the QP of 22 and isattenuated for
the lower bitrates (i.e. QP values of 32 and 37). In comparison to the CIF resolution results, the
performance losses shown here seem to be subjectively greater.
Comparing the RD performance of both MKLTs, it is possible to observe that the MKLT FRS achieves
a better coding performance, specially for the higher bitrates. In comparison to the CIF resolution
results, the difference between the MKLT FRS and the MKLT HRS performance seems to be
significantly higher. It remains to be seen if this difference can bring coding improvements to the AT
FRS over the AT HRS solutions, something that was not achieved with the CIF resolution sequences.
Following the method used to present the CIF resolution results, Figure 5.10 shows the RD performance for the
DCT, AT HRS, AT FRS and AT HFRS codecs with the Kimono sequence while Table 5.11 shows the
Bjontegaard metric results of the three adaptive transform modes against the DCT for the same HD sequence.
78
Figure 5.10 – Kimono sequence RD performance for the DCT, AT HRS, AT FRS and AT HFRS.
Table 5.11 – Kimono sequence average PSNR improvements and average bitrate savings for each AT mode
against the DCT.
Benchmark Average PSNR improvement
(dB) Average bitrate saving (%)
AT HRS vs. DCT 1.67 14.7
AT FRS vs. DCT 1.12 11.9
AT HFRS vs. DCT 1.89 16.0
From Figure 5.10 and Table 5.11, it is possible to derive the following conclusions:
In comparison to the codec only making use of the DCT, the AT codecs only achieve RD performance
gains for the lower bitrates (approximately below 20 Mbit/s). For the rest of the bitrate range, the AT
codecs have a very similar behaviour to the DCT based codec, being outperformed between the QP
values of 27 and 22, but achieving little coding gains for the higher bitrates. However, in terms of the
Bjontegaard metric, this is the sequence achieving the best average results in terms of prediction error
PSNR improvement (1.89 dB for the AT HFRS) and bitrate savings (16.0% for the AT HFRS) over the
DCT. Although the RD curves do not seem to show these substantial gains, they result from the fitting
process performed by the Bjontegaard metric computation for the very low bitrates.
Once again, the AT HFRS is the best adaptive transform solution available. However, in this case, the
AT HRS outperforms the AT FRS (with 0.55 dB of average PSNR improvement and 2.8% of average
bitrate savings). This comes as a surprise, notably taking into account the results in Figure 5.9;
however, as referred before, the presented MKLTs RD curves represent the average PSNR using the
same transform for all the inter-coded TUs. Clearly, the MKLT HRS, despite having a worse behaviour
in general when compared to the MKLT FRS, can provide a better coding efficiency for some particular
TUs and this is what determines the later performance.
79
The Kimono sequence percentage of inter-coded TUs for each QP and TU block size results are presented in
Table 5.12, while the percentage of TUs coded with the available transforms for each AT codec, QP and TU
block size for the same sequence are presented in Table 5.13.
Table 5.12 – Kimono sequence percentage of inter-coded TUs for each QP and TU block size.
QP TU sizes
4×4 8×8 16×16 32×32
16 49% 35% 13% 3%
22 5% 38% 39% 18%
27 2% 30% 44% 24%
32 0% 22% 47% 30%
37 0% 15% 45% 40%
From Table 5.12, it is possible to observe that, for this particular HD resolution video sequence, the percentage
of 4×4 TUs used for QP values of 22, 27, 32 and 37 is almost insignificant. On the other hand, the percentage of
large TUs (16×16 and 32×32) used for these QPs is considerably higher than the same average results for the
CIF resolution video sequences. This proves that, for this particular type of sequence, the HEVC takes advantage
of the larger homogenous areas existing in the HD video sequences, partitioning each frame using larger coding
blocks.
From Table 5.13, the following conclusions can be taken:
Like for the CIF resolution sequences, the selection of the MKLTs increases with the QP for all TU
block sizes. In this case, it is possible to observe that the MKLT becomes the most selected transform
from the QP of 27 until the last QP value (37).
Regarding the 4×4 TUs, it is only important to analyze the results obtained for a QP of 16, since this
is the only QP value where this type of TUs show a significant utilization percentage. In this case, the
DCT is selected in 57%, 56% and 42% of the cases for the AT HRS, AT FRS and AT HFRS,
respectively. For the 8×8 TUs, the DCT is selected, on average, 67% of the cases, for all the codecs in
the first two QP values (16 and 22); however, for the remaining QPs (27, 32 and 37), an average of
76% of the cases are coded with the MKLT.
Once again, the larger blocks seem to be better coded with the DCT for lower QPs (with 95% of the
16×16 and 32×32 TUs being coded with this transform for a QP of 16); however, for the higher QPs,
this pattern is totally reversed, with the majority of the larger TUs being coded with the MKLT (with
92% of the 16×16 and 32×32 TUs being coded with the MKLTs for a QP of 37).
The RD performance differences between the AT HRS and the AT FRS noted before (occurring for
the lower bitrates, i.e. higher QP values) seem to be due to the slightly higher MKLT HRS use in the
AT HRS codec in relation to the MKLT FRS use in the AT FRS codec for the 8×8 and 16×16 TUs
and for QP values of 32 and 37.
In conclusion, for the HD resolution video sequence Kimono, which has similar motion activity and spatial
details characteristics to the CIF resolution sequence offering the better adaptive transform RD performance over
the DCT (Container sequence), the proposed adaptive transform can achieve a average prediction error PSNR
improvement of 1.85 dB and average bitrate saving of 15.9% over the DCT using the mode making use of the
three available transforms (DCT, MKLT HRS and MKLT FRS). In this particular case, the performance gain of
this last codec in relation to the AT codec using only the HRS mode is not as significant as observed for the CIF
resolution sequences. In this way, it remains to be seen if the complexity increase caused by the computation of
the MKLT basis function with a set of 405 prediction error blocks, instead of the 75 blocks used with the HRS,
is worthwhile in relation to the performance gains achieved. This issue is even more relevant for the HD
sequences, as they tend to use larger TUs which require a larger computational effort.
80
Table 5.13 – Kimono sequence percentage of TUs coded with the available transforms for each AT code, QP
and TU block size.
Codec Selected
transform
TU sizes
4×4 8×8 16×16 32×32
QP = 16
AT HRS DCT 57% 76% 94% 98%
MKLT HRS 43% 24% 6% 2%
AT FRS DCT 56% 75% 94% 97%
MKLT FRS 44% 25% 6% 3%
AT HFRS
DCT 42% 65% 90% 96%
MKLT HRS 25% 16% 5% 2%
MKLT FRS 32% 19% 5% 2%
QP = 22
AT HRS DCT 57% 67% 83% 86%
MKLT HRS 43% 33% 17% 14%
AT FRS DCT 55% 66% 82% 86%
MKLT FRS 45% 34% 18% 14%
AT HFRS
DCT 44% 55% 76% 82%
MKLT HRS 19% 19% 11% 8%
MKLT FRS 37% 26% 13% 9%
QP = 27
AT HRS DCT 49% 41% 51% 55%
MKLT HRS 51% 59% 49% 45%
AT FRS DCT 49% 42% 51% 55%
MKLT FRS 51% 58% 49% 45%
AT HFRS
DCT 39% 32% 43% 52%
MKLT HRS 17% 20% 25% 22%
MKLT FRS 43% 47% 32% 27%
QP = 32
AT HRS DCT 47% 24% 23% 22%
MKLT HRS 53% 76% 77% 78%
AT FRS DCT 43% 25% 25% 22%
MKLT FRS 57% 75% 75% 78%
AT HFRS
DCT 35% 18% 19% 20%
MKLT HRS 13% 15% 29% 34%
MKLT FRS 52% 66% 52% 46%
QP = 37
AT HRS DCT 39% 13% 10% 6%
MKLT HRS 61% 87% 90% 94%
AT FRS DCT 39% 15% 13% 6%
MKLT FRS 61% 85% 87% 94%
AT HFRS
DCT 30% 10% 8% 5%
MKLT HRS 9% 11% 27% 35%
MKLT FRS 61% 79% 66% 60%
81
5.3. Summary
In this chapter, the proposed adaptive transform has been tested to evaluate its performance against the DCT. To
do this, three different adaptive transforms were used besides the usual DCT: one using a Half Range shift and
rotation parameters Set (HRS) to compute the MKLT basis functions (with maximum δ = 0.5 pixel and θ = 0.5°
as used in [15]), another using a Full Range shift and rotation parameters Set (FRS) to compute the MKLT basis
functions (introducing a maximum δ = 1 pixel and θ = 1°) and a final adaptive transform than can use both the
HRS and FRS modes to compute the MKLT basis functions.
The performance tests were made using two types of video sequence resolutions: CIF and HD. The first type was
used to test the proposed adaptive transform using similar sequences to those tested in [15], which used the
H.264/AVC codec. As the HEVC codec is being developed with the high and ultra high definition video
contents in mind, one HD resolution video sequence was tested, to assess its performance benefits in comparison
to the lower resolution video sequences.
The obtained results have shown that the proposed adaptive transform using a combination of the HRS and FRS
modes can achieve a 0.61 dB gain of objective prediction error and 7% bitrate savings for the CIF sequences,
always on average and over the DCT. For the other two adaptive transforms, the average results are very similar,
with a prediction error PSNR improvement of 0.44 dB and bitrate savings of 5% over the DCT. This results
show that the use of an additional FRS mode does not bring any compression improvements when used alone,
but can bring approximately 0.2 dB of average PSNR improvement and 2% of average bitrate savings when used
in combination with the HRS mode.
For the HD resolution video sequence, the obtained results revealed reasonably higher coding gains that those
obtained for the CIF sequences, although these gains are only verified for the low bitrate values. In this way, the
adaptive transform using both the HRS and FRS modes was able to achieve 1.89 dB better prediction error
objective quality and 16.0% bitrate saving in relation to the DCT, always on average. In this case, the adaptive
transform only using the HRS mode to compute the MKLT basis functions clearly outperformed the adaptive
transform using the FRS mode. In this way, the adopted coding solution with an adaptive transform as proposed
in [15] could achieve a prediction error PSNR improvement of 1.67 dB and bitrate savings of 14.7% over the
DCT, always on average. Since the use of a FRS mode introduces a significant complexity increase in the video
codec (as it uses 5.4 times more estimated prediction error blocks), the similarity of the results with only the
HRS mode and with both the HRS and FRS modes indicate that the use of FRS may not be useful for HD
resolution video sequences.
83
Chapter 6
Conclusions and Future Work
This chapter concludes this Thesis report by presenting a brief summary of what was presented in each of the
previous chapters. Additionally, some conclusions are taken in relation to this Thesis initially defined objectives.
Finally, some future work ideas are presented.
6.1. Summary and Conclusions
The first chapter of this report introduced the reader to the context in which this work is relevant as well as to the
emerging problem that is asking for a solution: the efficient compression of HD and UHD content. Besides this,
the objectives of this Thesis were also defined.
Chapter 2 introduced the basic principles and concepts about transform coding. Additionally, the most important
transforms in the signal processing context were reviewed, namely the DCT and the KLT used in the developed
solution.
Prior to the actual presentation of the adopted coding solution, the two main background technical elements were
introduced. In this way, the video coding solution proposed by Biswas et al. in [15] was first described in detail,
with a natural focus on the proposed adaptive transform, as it would serve as the base for the adaptive transform
used in this Thesis solution. Then, the emerging HEVC standard was also presented. This under development
standard intends to become the next state-of-the-art video coding standard and targets to reduce by half the
bitrate currently needed to code a particular video sequence with a specific quality using the H.264/AVC
standard.
With Chapter 4, the adopted coding solution is finally presented to the reader. This presentation includes the
functional description of each of its coding processes (encoder and decoder) and also a functional description of
the HEVC framework used to extract data processed by the HEVC codec. Besides these functional descriptions,
the implementation details are also described, focusing on the developed MATLAB script used to implement the
proposed adaptive transform.
84
Finally, in Chapter 5, a performance evaluation of the adopted coding solution is made. This evaluation is
performed using three CIF sequences and one HD sequence coded with the adopted coding solution and
comparing the obtained RD results with those obtained using the popular DCT. With this, it was possible to
conclude that the adaptive transform can achieve encouraging bitrate savings over the DCT, particularly for the
tested HD sequence.
In summary, it can be said that the objectives defined in Chapter 1 were achieved. Thus, a recent advance related
to transform coding was studied, implemented and assessed in the context of the HEVC standard. Although the
integration of the studied adaptive transform in the HEVC standard was not fully accomplished (for the reasons
explained in Chapter 4), it was possible to extract the necessary data to simulate as much as possible a full
integration scenario. With this, a performance evaluation of a video coding solution including the adopted
adaptive transform was successfully made, showing positive results when compared to the currently used
transform coding tools.
6.2. Future Work
Clearly, the first improvement that can be made to the coding solution developed in this Thesis is related to the
full integration of the used adaptive transform in the HEVC codec. Future releases of new software versions of
this codec tend to become more legible and organized from a programmer point of view. In this way, the
proposed adaptive transform should be fully integrated in HEVC to allow a more complete and accurate
evaluation of the performance changes introduced by the adaptive transform. A full integration of the adaptive
transform in the HEVC codec would allow the following evaluation improvements regarding the work
developed in this Thesis:
Frame partitioning – By integrating the adaptive transform in the HEVC codec, it would be possible
for the encoder to make the frame partitioning in a RD optimization sense using not only the DCT (as
made on this Thesis), but also the proposed MKLT.
Reference frame – All the reference frames used in the coding solution developed in this Thesis are
obtained from previous codings made with the HEVC codec. With a fully integrated model, these
reference frames would also reflect previous codings using the proposed adaptive transform.
Quantization and entropy coding – As referred in Chapter 4, the quantization and the entropy coder
used in the adopted coding solution are not the same as the ones currently present in the HEVC codec.
In this way, the use of the actual coding tools used by the HEVC would allow a more accurate
evaluation of the performance results.
Other improvements – With the integration of the adaptive transform in the HEVC codec, it would be
possible to use other test conditions not used in this Thesis due to the necessary implementation
simplifications. For example, it would be possible to use B-frames and multiple reference frames.
In the case of obtaining positive RD performance gains with the fully integrated coding solution proposed
before, then the next step should target the study of the computational complexity associated to this solution,
which was not considered in this Thesis. This would be important to evaluate the trade-off between the
additional complexity and the coding gains associated to the adaptive transform and to possibly develop new
algorithms allowing a faster computation of the encoding and decoding process.
85
Appendix A
Transforms in Available Image/Video
Coding Standards
All the available image and video coding standards make use of transform tools in their coding architecture. To
have an idea on the used transforms and their details, the available coding standards are briefly reviewed in the
following with particular emphasis on the transform related aspects. Besides the transform details, this appendix
also contains a brief review of the objectives, main features, technical improvements and performance of each
standard. The first two standards reviewed – JPEG and JPEG 2000 – are image coding standards; the following
standards – H.261, MPEG-1 Video, MPEG-2 Video, H.263, MPEG-4 Visual and H.264/AVC – are all video
coding standards.
A.1. JPEG Standard
The JPEG image coding standard was defined in 1992 by the Joint Photographic Experts Group (JPEG) [42]. It
is formally known as Recommendation ITU-T T.81 and ISO/IEC 10918-1 standard. This standard specifies two
classes of encoding and decoding processes: lossy and lossless. For this review, only the lossy class is
considerer, since it is the only one using transform coding. This class is known as the JPEG Baseline Sequential
process and it is the most used JPEG coding solution.
A.1.1. Objectives
The objective of this standard is to define a generic compression standard for multilevel photographic images. Its
main requirements are:
Efficiency – It must be based on the most efficient compression techniques available, in order to use the
smallest possible amount of bits for a particular target quality.
Adjustable compression/quality – The level of compression must be adjustable, allowing a selectable
trade-off between number of bits used and image quality obtained.
86
Generic – It must be applicable to all kinds of multilevel photographic images, independently of their
resolution, aspect ratio, etc.
Low complexity – It must be implemented with reasonably low complexity, in order to allow its
implementation on a wide range of platforms and applications.
With these requirements, JPEG is designed to be used in a wide range of applications, e.g., digital photography,
color facsimile, medical and scientific images, etc.
A.1.2. Technical Approach and Architecture
The JPEG coding process adopted a DCT based image coding architecture which is presented in Figure A.1.
Figure A.1 – JPEG encoder architecture [42].
A short walkthrough of the encoding process is presented next:
1. Block splitting – The original image is divided into 8×8 samples blocks. If the input data does not
represent an integer number of blocks, then the encoder must fill the incomplete blocks with some
dummy data.
2. Forward DCT – Each 8×8 block is then transformed using a 2-D forward DCT, resulting in a set of
8×8 (64) DCT coefficients.
3. Quantization – Each of the 64 coefficients is then quantized using a specific quantization matrix.
4. Entropy encoder – After quantization, the quantized-DCT coefficients are arranged into a one-
dimensional zigzag sequence (see Figure A.2). Using this sequence ensures that the encoder will
encounter all non-zero DCT coefficients in the block as early as possible. Moreover, since this zigzag
ordering roughly corresponds to the coefficients perceptual relevance, its usage guarantees that more
perceptually important coefficient are always transmitted before less perceptually important
coefficients. The next step is to create a (run, level) pair for each coefficient. The run is the number of
null DCT coefficients preceding the coefficient being coded in the zigzag sequence. The level is the
quantized amplitude of the coefficient to be coded. The run and the number of bits used to code the
level (size) are then encoded using Huffman tables and the level is encoded using a Variable Length
Integer (VLI) code. To better exploit the spatial correlation, the DC coefficient of each block is coded
as the difference to the DC coefficient of the previous neighbor block.
Figure A.2 – Zigzag sequencing for the DCT coefficients within a block in JPEG [42].
87
The decoding process is essentially the inverse of the encoding process. The entropy decoder decodes the zigzag
sequence of quantized DCT coefficients and then, after the inverse quantization process, the DCT coefficients
are transformed to an 8×8 block of samples by the inverse DCT. If the inverse DCT implementation is not fully
specified, there may exist some mismatches regarding the original image due to truncations and roundings in the
finite arithmetic implementations.
A.1.3. Transform and Quantization
As mentioned above, the JPEG Baseline Sequential mode uses a 2-D DCT. This transform is unitary (and
orthogonal) and separable and is given by
(A.1)
where:
y(k,l) is the DCT coefficient at coordinates (k,l)
x(m,n) is the sample value – luminance or chrominances - at coordinates (m,n)
The quantization matrices are not standardized, but JPEG suggests a quantization matrix using values
corresponding to the minimum perceptual differences for each DCT coefficient; this basic quantization matrix
may be used to generate „lower quality‟ quantization matrixes by multiplying this matrix by a certain integer
quantization factor. Considering the HVS characteristics, the quantization steps used are typically lower for the
lower frequencies and vice-versa. In this way, more quantization noise is injected in the less perceptually
relevant frequencies, the higher frequency coefficients; this is very important to exploit the signal irrelevance,
this means avoiding to transmit image information that cannot be visually perceived. The quantization matrices
have to be transmitted or signaled in the case the suggested quantization matrix is used.
A.1.4. Performance Evaluation
The quality of the JPEG decoded images greatly depends on the quantization steps used for the encoding
process. For higher quantization steps, the compression ratio will increase, but the quality of the reconstructed
image will suffer from the data reduction; this means less coefficients are coded or the same coefficients are
coded but with more quantization noise. It is important to understand that the compression performance of a
JPEG encoder will strongly depend on the choices made by the encoder in terms of which coefficients are coded
and which quantization steps are used for each coded coefficient. For example, in Figure A.3, the same image is
encoded with a small quantization step (left side) and with a big quantization step (right side). Despite the
greater compression ratio achieved for the right side image in Figure A.3, it has very low quality when compared
to the left side image, with extreme loss of color and detail. The image coded using large quantization steps
shows very well the typical coding artifact resulting from a block based transform coding solution like JPEG: the
block effect. Since the image is coded as (artificially) independent blocks, with the exception of the DC
coefficient prediction, when the number of bits per block is reduced, fewer coefficients are sent and more
quantization noise is inserted, boosting the impact of the block boundaries; this is very evident for some blocks
of the right side image in Figure A.3 where only the DC coefficient is transmitted.
88
Figure A.3 – Image coded with JPEG using small quantization steps (compression ratio is 2.6:1) on the left side
and using large quantization steps (compression ratio is 144:1) on the right side [43].
The compression ratio achieved for a specific image will also depend of its particular characteristics, e.g., for
highly detailed images there isn‟t much spatial redundancy to exploit, thus, the amount of data required to
represent these images can‟t be as reduced as for smoother, lower frequency images. For example in [14], it is
stated that transparent quality may be typically reached at about 1.5-2 bit/pixel while a medium to good quality,
enough for some applications, may be reached at about 0.25-0.5 bit/pixel.
A.2. JPEG 2000 Standard
JPEG 2000 is another image coding standard created by the JPEG committee around 2000 [44], this means more
than 10 years after the JPEG standard. Officially, JPEG 2000 corresponds to the ISO/IEC International Standard
15444-1.
A.2.1. Objectives
The JPEG 2000 standard was created with the objective of providing improved compression performance and
subjective image quality when compared to the existing standard from the same standardization body, the JPEG
standard. It was also intended to be more flexible than the JPEG standard, being suitable for different types of
still images (e.g. bilevel, grayscale, color, etc), with different characteristics (e.g. natural, computer generated,
medical, text, etc) and with different imaging models (e.g. real-time transmission, image library archival, limited
bandwidth resources, etc), that is, suitable for a wide number of applications, e.g. Internet, color facsimile,
printing, scanning, digital photography, medical imagery, E-commerce, etc. To fulfill these goals, the JPEG
2000 was created with a number of requirements in mind, mainly:
Good compression performance at low bitrates;
Lossless and lossy compression;
Progressive transmission by quality, resolution, component and spatial locality (i.e. scalability);
Random (spatial) access to the bitstream;
Robustness to bit-errors.
Besides the improvement of the compression performance and quality when compared to the JPEG standard,
JPEG 2000 defined a very important new objective: scalability. Thus, JPEG 2000 is defined in such a way to
allow the extraction of different resolutions, pixel fidelities, SNR and visual quality, and more, all from a single
compressed bit-stream. With this feature, is possible to use this standard for any target device, transmitting only
the essential or possible data.
A.2.2. Technical Approach and Architecture
The JPEG 2000 encoder architecture is illustrated in Figure A.4. Before proceeding with the walkthrough of the
encoding process illustrated in Figure A.4, it should be noted that each image may be coded as a whole or
divided in tiles. Tiles are rectangular non-overlapping areas that are compressed independently, as they were
entirely distinct images; most times there is a single tile meaning the full image is a tile.
89
Figure A.4 – JPEG 2000 encoder architecture [45].
A short walkthrough of the encoding process is presented next:
1. Forward DWT – First, each tile (or the whole image) is transformed using a 2-D DWT. With this
transform, the image components, e.g. typically luminance and chrominances, are decomposed into
different resolution levels. These decomposition levels are made up of sub-bands populated with DWT
coefficients describing the frequency characteristics of local areas of each image component.
2. Quantization – The DWT coefficients are after quantized. This quantization process is described with
more detail in the next section.
3. Entropy encoder – Then, each sub-band of the DWT decomposition is divided up into regular non-
overlapping rectangular blocks, called code-blocks. Entropy coding is performed independently on each
code-block, bitplane by bitplane. Bitplanes are binary arrays representing a code-block from its Most
Significant Bit (MSB) to its Less Significant Bit (LSB), as shown in Figure A.5. Each individual
bitplane is coded with Context-based Adaptive Binary Arithmetic Coding (CABAC), resulting in
compressed bit-streams for each code-block.
Figure A.5 – Example of a bitplane from a particular code-block [14].
4. Bit-stream organization – In this step, the compressed bit-streams are organized in packets. Each
packet can be interpreted as one quality increment, for one resolution level, at one spatial location.
These packets can also be grouped in layers where each layer can be interpreted as one quality
increment for the entire image at full resolution.
With the utilization of a wavelet transform and the organization of the codestream as described above, JPEG
2000 assures quality and spatial resolution scalability.
A.2.3. Transform and Quantization
As noted above, the JPEG 2000 standard uses a 2-D DWT. This transform can be:
Irreversible – The default irreversible transform is implemented by means of the Daubechies 9/7 filter;
this filter is used for lossy coding. The analysis and the corresponding synthesis filter coefficients are
given in Table A.1.
90
Table A.1 – Irreversible Daubechies 9/7 analysis and synthesis filter coefficients [45].
Reversible – The default reversible transform is implemented by means of the 5/3 filter, which
coefficients are given in Table A.2; this filter is used for lossless coding.
Table A.2 – Reversible 5/3 analysis and synthesis filter coefficients [45].
Figure A.6 shows an example of the DWT used in JPEG 2000. In this case, a two-level DWT using a
Daubechies 9/7 filter is shown.
Figure A.6 – Example of a 3-levels DWT decomposition as used in JPEG 2000 [46].
From the observation of Figure A.6, it is possible to identify the various DWT decompositions with each level
providing more data to the final image, allowing more resolution.
After the transformation, the DWT coefficients are subject to uniform scalar quantization, employing a fixed
dead-zone around the origin. This is accomplished by dividing the magnitude of each coefficient by a
quantization step size and rounding down. One quantization step size is allowed per sub-band. The standard does
not define any method for the step size selection, so several methods can be used at will. A possible way to
select the quantization steps is related to the visual importance of each sub-band coefficients for the final image
quality, selecting bigger step sizes for less important coefficients and vice-versa.
A.2.4. Performance Evaluation
To check if the initially defined goals were achieved, JPEG 2000 performance will be here compared with the
previous JPEG performance. The superiority of JPEG 2000 can be subjectively judged with the help of Figure
A.7, where part of the reconstructed image Woman is shown after compression at 0.125 bpp (bits per pixel), and
Figure A.8, which shows the same result after compression at 0.25 bpp.
91
Figure A.7 – Reconstructed images compressed at 0.125 bpp by means of (a) JPEG and (b) JPEG 2000 [47].
Figure A.8 – Reconstructed images compressed at 0.25 bpp by means of (a) JPEG and (b) JPEG 2000 [47].
For the lower bitrates, the quality of the reconstructed images using JPEG 2000 is clearly better than using
JPEG, as shown in Figure A.7 and Figure A.8, since JPEG 2000 does not suffer from the block effect. As the
bitrate increases, the JPEG 2000 performance superiority decreases also because the block effect tends to
disappear. Visual comparisons of JPEG compressed images and JPEG 2000 compressed images show that, for a
large category of images, JPEG 2000 file sizes are on average 11% smaller than JPEG at 1.0 bpp, 18% smaller at
0.75 bpp, 36% smaller at 0.5 bpp and 53% smaller at 0.25 bpp [47]. However, even though JPEG 2000 can
achieve higher compression ratios for the same quality when compared to JPEG, this comes at the price of
additional complexity [48], which can be perceived as a drawback for some applications requiring low
complexity coding. For these applications, JPEG may still be the best solution.
A.3. H.261 Recommendation
H.261 is a 1990 video coding standard developed by the VCEG of the ITU-T [49]. It is officially known as
Recommendation H.2614 and was the first international video coding standard with relevant market adoption.
A.3.1. Objectives
This standard was designed for videotelephony and videoconference applications over Integrated Services
Digital Network (ISND) telephone lines. The ISDN lines typically have bitrates that are multiples of 64 kbit/s
(p×64 kbit/s). H.261 operates at bitrates between 40 kbit/s and 2 Mbit/s and supports QCIF (176×144 pixels)
and, optionally, CIF (352×288 pixels) spatial resolutions at 4:2:0 subsampling (each chrominance is subsampled
by a factor of 2, both horizontally and vertically). The coding algorithm operates over progressive content at 30
frames/s but this frame-rate can be reduced by skipping 1, 2 or 3 frames for each transmitted one.
Because of its target applications, this standard has critical delay requirements, in order to allow a normal
bidirectional conversation. On the other hand, its quality requirements are not so critical since, in this case, a
lower or intermediate quality may be enough for a good personal communication.
4 Formally speaking, ITU issues recommendations and ISO/IEC issues standards.
92
A.3.2. Technical Approach and Architecture
To achieve high compression efficiency, video coding solutions have to exploit the spatial redundancy, typically
using a transform, the temporal redundancy, typically making some prediction in time, and the statistical
redundancy, typically through entropy coding. This would result in a lossless video coding solution. However,
since a lossless video coding solution would not achieve the necessary compression factors, video coding
solutions also exploit the visual irrelevancy to eliminate, through quantization, all the information which is not
perceptually relevant; this would result in transparent quality (perceptually similar to the original quality). If
higher compression factors are necessary, the encoder may also eliminate relevant information, thus implying
there is some quality degradation regarding the original quality (although this should happen in the most
graceful way possible).
The basic units for H.261 video coding are the macroblocks (MBs). Each macroblock corresponds to 16×16
luminance samples. In H.261, there are two main ways of coding each macroblock:
Intra-coding – These macroblocks are basically coded using the same techniques used in JPEG, which
are applied to the macroblock. In this case, no temporal redundancy is exploited. Intra-coding is
mainly used for the first picture, for later pictures after a change of scene and also for the macroblocks
corresponding to novel „objects‟ in the scene. For the intra-coded macroblocks, the encoding process
has the following steps (illustrated in Figure A.9):
o Forward DCT – The macroblock is divided in 8×8 blocks, which are transformed using a 2-
D forward DCT.
o Quantization – The resulting DCT coefficients are then quantized.
o Entropy encoder - All quantized coefficients are then ordered in a 1-D zigzag sequence.
Each coefficient is represented using a bi-dimensional symbol, (run, level), where its position
and quantization level are indicated. To exploit the statistical redundancy, these symbols are
then coded using Huffman coding.
Figure A.9 – Basic H.261 intra-encoding architecture [50].
Inter-coding – With this coding mode, it is possible to use information from previous frames to code
the current frame, taking advantage of the temporal redundancy between neighbor frames. Moreover,
this coding mode can also detect, estimate and compensate the motion in the sequence, making much
improved temporal predictions, thus reducing the prediction error. Inter-coding is used in sequences of
similar pictures, including those containing moving objects. For the inter-coded macroblocks, the
encoding process considers the following steps (illustrated in Figure A.10):
o Motion estimation – To assess the existence of motion, the current macroblock is compared
with the macroblocks in the neighborhood of the corresponding macroblock in the previous
frame. If motion is detected, its horizontal and vertical directions are stored in two integers,
the motion vector components. The motion vectors (MV) are then entropy encoded. Although
very important to increase the compression efficiency, motion estimation is implies a very
high computational effort.
o Sending the differences – If there is motion estimated in the previous step, the difference
between the current macroblock and the prediction macroblock is computed performing the
so-called motion compensation. Otherwise, the difference (prediction error) is computed
between the current macroblock and the corresponding macroblock in the previous frame.
93
These differences, which should ideally be as small as possible, are then transformed,
quantized and entropy encoded.
Figure A.10 – Basic H.261 inter-encoding architecture [50].
It is important to note that for both coding modes – intra and inter - the encoder as to perform the corresponding
decoding process in order to store the decoded information for future inter-coding. The prediction process may
be modified by a loop filter (LF) that can be switched on and off to improve the picture quality by removing
high-frequency noise when needed.
A.3.3. Transform and Quantization
The transform used in the H.261 standard is very similar to the one used in JPEG. It is a 2-D separable DCT of
size 8×8. Before the computation of the transform, the data range is also arranged to be centered on zero this
means a subtraction of 128 is applied to the samples in the 0-255 ranges for 8 bits samples.
H.261 can use as quantization steps all even values between 2 and 62. Within each macroblock, all DCT
coefficients are quantized with the same quantization step with the exception of the DC coefficient for intra-
coded macroblocks, which are always quantized with step 8, due to their critical perceptual relevance.
A.3.4. Performance Evaluation
As the first video coding standard, H.261 does not have any previous video coding standard to be compared
with. Still, it is possible to evaluate its performance depending on the available bitrate, the characteristics of the
video sequences and, very important, the used encoding tools. For example, Figure A.11 shows the image
quality (using the PSNR as quality metric) versus the bitrate for the well know videotelephony sequence Miss
America with QCIF resolution; the chart shows RD performance results for the sequence coded at 30 frames/s
and at 10 frames/s; moreover, results are shown with and without motion vectors, and with and without a low
frequency loop filter when motion vectors are used (+MV+LF).
94
Figure A.11 – Average PSNR (dB) versus bitrate (kbit/s) for various H.261 combinations of tools for the Miss
America sequence [51].
Observing Figure A.11, it is clear that the image quality depends greatly on the available bitrate; as expected, for
the lower bitrates, the average PSNR is lower than for higher bitrates. For a certain bitrate, the video sequence at
10 frames/s has, on average, more bits per frame than the video sequence at 30 frames/s. Thus, the average
PSNR for the video sequence with the lower frame rate is typically higher although the motion impression may
not be as good if the sequence has more intense motion. Since the motion estimation process is lossless, using it
reduced the prediction error and increases the average PSNR as bits are saved to reduce the quantization step
applied to the coefficients of the inter-coded and intra-coded macroblocks, if a certain, fixed bitrate is used.
In Figure A.12, the PSNR variation against the compression ratio is shown for the same video sequence. With
the increase of the compression ratio, the number of bits available to represent the video sequence decreases; this
results in a reduction of the average PSNR value and a consequent degradation of the image quality.
Figure A.12 – Average PSNR (dB) versus compression ratio for various H.261 combinations of tools for the
Miss America sequence [51].
Analyzing both charts, it is possible to conclude that the introduction of motion compensation and a loop filter
always improves the quality of the reconstructed video sequence for all the bitrates and compression ratios,
95
naturally at the price of some additional computational complexity. The improvements are more noticeable for
the lower bitrates and higher compression ratios.
A.4. MPEG-1 Video Standard
The MPEG-1 Video standard was the first video coding standard defined by the MPEG. It was finalized around
1993 and it is formally known as ISO/IEC 11172-2 [52].
A.4.1. Objectives
The main target of the MPEG-1 standard was to efficiently compress audiovisual information for digital storage,
notably to digitally store a VHS quality audiovisual sequence in a Compact Disc (CD). Thus, the MPEG-1
standard defines video and audio codecs in its associated Video and Audio parts.
For MPEG-1 Video, the target bitrate is around 1.2 Mbit/s to compress CIF resolution at 25 Hz video. Unlike
H.261, MPEG-1 Video does not have critical real-time requirements since the main target are not real-time
applications; however, it has some other critical requirements related to digital video storage, such as random
access, to provide the typical storage functionalities, such as fast forward and reverse playback, edition, etc. This
standard was originally optimized for the SIF, which has 352×288 pixels at 25 Hz and 352×240 for 30 Hz with
4:2:0 subsampling.
A.4.2. Technical Approach and Architecture
Besides the prediction methods used in H.261, where a macroblock can be predicted from a macroblock in the
previous frame (forward prediction), MPEG-1 Video also adopts backward prediction, based on the principle
that a macroblock can be predicted also taking as reference a future frame macroblock. This type of temporal
prediction has its costs, specially in terms of coding delay and complexity, which may be acceptable considering
that real-time applications are not the main target and offline coding is the main application scenario.
Because of the required storage facilities referred before, MPEG-1 Video defines three types of frames
depending on the coding tools used:
Intra-frames (I-frames) – The I-frames include only intra-coded macroblocks. These frames are
mainly used to provide random access since they do not depend on any other frames. They also prevent
error propagation associated to the channel errors, since all the other frames types depend on other
frames and, thus, may propagate their errors.
Inter-frames – The inter-frames may included intra and inter-coded macroblocks. There are two
classes of inter-frames in MPEG-1 Video:
o P-frames – In these frames, the inter-coded macroblocks can only be predicted from
macroblocks from the previous I or P-frame (forward prediction).
o B-frames – The inter-coded macroblocks in B-frames can use forward prediction, backward
prediction or an average of both forward and backward predictions, the so-called bidirectional
prediction. These predictions may only be based on the adjacent I and P-frames. B-frames
typically require fewer bits than any other frame type for a certain quality; however, if too
many B-frames are successively used, the coding delay increases and the compression
efficiency is reduced since the reference frames (I or P) for the B frames will be farther away
and, thus, the prediction error will be higher.
It is important to stress that the typical additional compression efficiency of P-frames regarding I-frames and of
B-frames regarding P-frames is deeply related to the additional complexity associated to the motion estimation
process (with one and two reference frames for P and B frames, respectively) and the additional delay for B
frames.
The MPEG-1 video encoder architecture is presented in Figure A.13.
96
Figure A.13 – Basic MPEG-1 Video encoder architecture [53].
The walkthrough of the architecture shown in Figure A.13 is presented next:
For intra-coded macroblocks
o Forward DCT – After splitting the macroblock in 8×8 blocks, the samples are transformed
using a 2-D forward DCT.
o Quantization – Subsequently, the DCT coefficients are quantized.
o Entropy encoder – Finally, the quantized DCT coefficients are entropy encoded using
Huffman coding.
For inter-coded macroblocks
o Motion estimation – The previous and the future (I or P) prediction frame(s) macroblocks are
compared to the current macroblock. If this operation detects motion, the motion vectors are
entropy encoded. MPEG-1 Video uses half-pixel motion estimation accuracy to allow a more
precise estimation of the motion with the consequent reduction of the prediction error.
o Sending the differences – If there is motion detected, the differences are coded using motion
compensation. Otherwise, they are simply predicted by the relevant prediction frame(s)
corresponding macroblock(s). These differences are then transformed, quantized and entropy
encoded.
This walkthrough is valid for all the standards presented in the next sections; thus, it will not be repeated, and
only relevant differences will be referred.
A.4.3. Transform and Quantization
The MPEG-1 Video standard uses a 2-D separable DCT of size 8×8; this is not different from the transform used
in both the JPEG and H.261 standards.
The quantization process used in MPEG-1 Video is similar to the one used in JPEG. The quantization step may
be different for each DCT coefficient and it is defined with quantization matrices. There are two basic standard
quantization matrices: one for intra-coding and another for inter-coding (see Figure A.14). For inter-coding, the
high frequency coefficients are not necessarily associated to high frequency content since they can result from
block effects in the reference image(s), poor image compensation or camera noise; in this context the
quantization steps are constant. For intra-coding, absolute energies are being coded and, thus, their quantization
should take into account the visual sensitivity to the various spatial frequencies. The quantization matrices may
be changed to achieve a better coding efficiency. Like H.261, the DC coefficients of intra-coded macroblocks
are always quantized with step 8.
97
Figure A.14 – MPEG-1 Video standard quantization matrices [54].
In MPEG-1 Video, the DC coefficients are differentially coded within each macroblock and between neighbor
macroblocks. This is done in order to exploit the similarities between the adjacent blocks DC coefficients.
A.4.4. Performance Evaluation
The technical improvements introduced in MPEG-1 Video bring a significant increase in terms of compression
efficiency when compared to H.261, notably the bidirectional predictions and the half pixel motion accuracy;
this increase typically comes at the cost of some computational complexity and delay. For video storage, these
costs are not as critical as for real-time video communications. Therefore, MPEG-1 Video fulfils its main
objective of providing a powerful video compression solution for video storage.
For less complex sequences and lower bitrates, H.261 typically achieves higher compression ratios than MPEG-
1 Video at comparable qualities since MPEG-1 Video was optimized for bitrates in the range of 1.2 Mbit/s [55].
Thus, for videotelephony and videoconference, which content typically has less complex motion, where lower
bitrates are typically available and lower computational complexity and real-time performance are required, the
H.261 standard may still be the best choice between these two standards. However, for more general video
content, like movies, MPEG-1 Video provides significant compression efficiency advantages at the costs already
mentioned.
A.5. MPEG-2 Video Standard
The MPEG-2 Video standard (MPEG-2 Part 2) was finalized around 1996 in a joint collaborative team where
MPEG and ITU-T joined efforts [56]. It is formally known as ISO/IEC standard 13818-2 and Recommendation
ITU-T H.262.
A.5.1. Objectives
Jointly developed by both the ISO/IEC MPEG and the ITU-T VCEG standardization groups, this was the first
video coding standard created for both broadcasting and storage. MPEG-2 Video is designed to code high
quality and resolution video sequences without noticeable quality loss, notably with the following quality
targets:
Secondary distribution – For broadcasting to the users, the signal quality at 3-5 Mbit/s must be better
or similar than the quality of available analogue systems, i.e. PAL, SECAM and NTSC.
Primary distribution – For contribution (e.g. transmission between studios), the signal quality at 8-10
Mbit/s must be similar to the original quality; this means the quality of the raw PCM representation.
The main MPEG-2 Video target applications are digital television transmission (i.e. cable, satellite and terrestrial
broadcasting) and Digital Video Disc (DVD) storage. Initially, the MPEG-2 Video standard was intended to
cover video coding up to 10 Mbit/s, leaving the still higher bitrates and spatial resolutions for another standard to
be labeled MPEG-3. However, MPEG-3 was never defined since MPEG-2 Video also addressed the HD space
in an efficient way.
Unlike the previously reviewed standards, MPEG-2 Video targets the coding of interlaced video-content,
additionally to the usual progressive video content. This is useful due to historical reasons as analogue TV is
interlaced. Another feature introduced by this standard is scalability (i.e. temporal, spatial and fidelity); this
functionality may be especially useful to accommodate transmissions in heterogeneous networks and to various
types of terminals, e.g. with standard or HD resolution.
98
Because the MPEG-2 Video standard addresses a vast range of applications, the standard and, thus, the
associated tools have been structured in terms of Profiles and Levels. A Profile defines a subset of the coding
tools and, thus, of the bitstream syntax, providing a variety of features required by some applications with a
certain degree of complexity, e.g. interlaced coding, B-frames and scalability. Within each Profile, Levels are
defined to limit the range of operating parameters, such as the spatial resolution (352×288 to 1920×1152) and
bitrate (4 Mbit/s to 80 Mbit/s).
A.5.2. Technical Approach and Architecture
The coding tools used in MPEG-2 Video are very similar to those used in MPEG-1 Video. The two main
differences are related to the two main additional functionalities:
Interlaced coding – With the MPEG-2 Video standard, it is possible to code interlaced video content,
which is the format used by analogue broadcast TV systems.
Scalable coding – The MPEG-2 Video standard allows temporal scalability (i.e. change of frame rate),
spatial scalability (i.e. change of resolution) and fidelity scalability (i.e. change of quality). When
creating scalable bitstreams, a bitrate overhead typically arises when compared to the corresponding
non-scalable streams.
The MPEG-2 video encoder core architecture, this means without scalable coding, is presented in Figure A.15. It
is important to mention that temporal scalability is already provided by the MPEG-1 Video standard, as it is
naturally provided by the I, P and B temporal prediction structure, without any bitrate burden. This means that
the additional scalability capabilities in MPEG-2 Video mainly refer to spatial resolution and quality scalability.
Figure A.15 – Basic MPEG-2 Video encoder architecture [35].
A.5.3. Transform and Quantization
The MPEG-2 Video standard uses the same 2-D DCT used in MPEG-1 Video. For interlaced video content, it is
possible to use an alternate scanning order for the DCT coefficients (shown in Figure A.16). With this alternative
scanning order, the DCT coefficients corresponding to the vertical transitions are privileged in terms of scanning
order, since the vertical correlation is reduced for interlaced pictures with more motion.
MPEG-2 Video uses the same quantization techniques used in MPEG-1 Video, also making use of previously
presented quantization matrices. Once again, the DC coefficients of intra-coded macroblocks are always
quantized with step 8.
99
Figure A.16 – Zigzag and alternate scanning order for interlaced video content [35].
A.5.4. Performance Evaluation
In comparison to MPEG-1 Video, it is clear that MPEG-2 Video can produce better quality for interlaced video
regardless of the motion content [57]. However, for progressive video and for MPEG-1 Video target bitrates
(around 1.2 Mbit/s), MPEG-1 Video outperforms MPEG-2 Video. This is due to the fact that MPEG-2 Video
has a more complicated syntactical structure, which can increase the overhead information burden at lower
bitrates. For higher bitrates (3 Mbit/s and above), even for progressive video, the MPEG-2 Video standard can
achieve improved quality (for the same rate) in comparison to MPEG-1 Video [58], since the later was not
optimized for this type of bitrates. These results are in conformity with the initially defined standard objectives,
allowing very efficient compression for high resolution and quality video.
A.6. H.263 Recommendation
Recommendation H.263 was finalized around 1995 by the ITU-T VCEG standardization group [59]. It is
formally known as ITU-T Recommendation H.263.
A.6.1. Objectives
The H.263 standard was created with the intention of replacing H.261, improving its compression efficiency,
notably for the lower bitrates. The main motivation behind the creation of this standard was the lack of a
standard that could assure interoperability between digital videotelephony terminals for the analogue telephone
network (PSTN) and the emerging mobile networks. This standardization process had to be quick to provide a
fast deployment of interoperable products in the market. Thus, the H.263 standard is mostly based on the
existing technology, particularly the H.261 and the MPEG-1 Video coding tools.
A.6.2. Technical Approach and Architecture
Although H.261 and H.263 share the same basic coding structure, there are some differences between them.
Some of these differences are improvements that were already present in MPEG-1 Video. The main differences
between the H.261 and the H.263 standards are:
Target bitrate – The H.261 target bitrates is p×64 kbit/s (p = 1,2,…,30) whereas H.263 also aims at
bitrates below 64 kbit/s to allow videotelephony over the PSTN.
Picture formats – Besides the formats already used in H.261 (i.e. QCIF and CIF), H.263 also supports
the sub-QCIF, 4CIF and 16CIF formats.
Motion compensation accuracy – Like the MPEG-1 Video standard, H.263 supports half-pixel
accuracy.
Motion vector prediction – Motion vectors are coded differentially as in H.261, but, besides the
preceding macroblock, also macroblocks in the previous macroblock-row are used for motion vector
prediction; this allows increasing the bitstream error resilience.
PB-frames mode – A PB-frame consists of two pictures coded as one unit. The P-frame is predicted
from the last decoded P-frame and the B-frame is predicted from both the last and the current P-frame.
This allows increasing the decoded frame rate at a rather low bitrate cost.
VLC tables – The H.263 uses (run, level, eob) triplets to code the DCT coefficients and not anymore
(run, level) duplets to avoid explicitly coding the eob (End Of Block) symbol.
100
The H.263 encoder architecture is presented in Figure A.17.
Figure A.17 – Basic H.263 encoder architecture [60].
A.6.3. Transform and Quantization
In H.263, the transform used is the same 2-D DCT used in H.261. As usual, this transform is applied to 8×8
blocks.
In terms of quantization, H.263 uses the same method described for H.261. The same step size (even values
between 2 and 62) is used for all the coefficients in the same macroblock, except the DC coefficient of the intra-
coded macroblocks, which is quantized with step 8.
A.6.4. Performance Evaluation
The H.263 standard outperforms the H.261 compression efficiency for any bitrate, even above 64 kbit/s [61].
With this performance, it is possible to say that H.263 successfully replaced H.261 as the video compression
standard for lower bitrate communications. Furthermore, the H.263 complexity is only marginally higher than
the H.261 complexity [61].
A.7. MPEG-4 Visual Standard
The MPEG-4 Visual standard was finalized around 1999 by MPEG [62]. It is also called MPEG-4 Part 2 and it
is formally known as ISO/IEC 14496-2.
A.7.1. Objectives
The MPEG-4 Visual has the main target of specifying the codecs for various types of visual objects to be used in
the context of the MPEG-4 standard which adopted for the first time an object-based (and not frame-based)
visual representation paradigm. In this context, the MPEG-4 standard targets a large range of applications (e.g.
surveillance, mobile communications, streaming over the Internet/Intranet, digital TV, studio postproduction,
etc). The MPEG-4 Visual standard specifies codecs for natural and synthetic visual objects; in terms of video
codecs, it specifies both codecs for rectangular and arbitrarily shaped video objects. For rectangular objects, the
spatial resolution goes from sub-QCIF to studio resolutions around 4k×4k pixels; naturally, a frame as
considered in the previous standards, is a particular case of video object.
A.7.2. Technical Approach and Architecture
The MPEG-4 Visual standard includes tools for coding natural video and still images (visual textures). This
allows the coding of scenes containing both moving and still images using the same standard. Each scene to be
coded can be composed of one or several video objects. In object-based coding, the video frames are defined in
terms of Video Object Planes (VOP). Each VOP is then the momentary video representation of a specific object
of interest to be coded or to be interacted with. Each video object is encapsulated by a rectangle bounding box
which is then divided into 16×16 pixels macroblocks than can be classified as (see Figure A.18):
101
Transparent – Macroblocks in the bounding box that are completely outside the VOP; these
macroblocks do not need to be coded.
Opaque – Macroblocks in the bounding box that are completely inside the video object plane; these
macroblocks are intra or inter-coded using motion compensation and DCT encoding.
Boundary – Macroblocks in the bounding box that include the boundary of the video object plane;
these macroblocks are processed with specific tools for coding arbitrarily shaped objects.
Figure A.18 – Macroblock classification in MPEG-4 Visual [62].
Regarding rectangular (or frame-based) video coding, which is functionally similar to the frame based coding
solutions previously reviewed, there are some improvements introduced by MPEG-4 Visual, notably in terms on
motion compensation:
Quarter-pixel motion compensation – Motion compensation supports motion vectors with an
increased accuracy, notably one-quarter pixel, allowing improved predictions, thus reducing the
prediction error.
Global motion compensation – Instead of using local motion vectors for each macroblock, this tool
allows using also one motion vector for one video object plane (which may be a frame). This can be
important for sequences with a large portion of global translational motion (e.g. a camera panning) and
also for non-translational motion (e.g. zoom or rotation).
Direct mode in bidirectional prediction – This is a generalization of the “PB frames” introduced in
H.263. Both forward and backward predictions are used but the required motion vectors are derived
from the motion vector of the collocated macroblock in the backward-reference, and only a correction
term called delta vector is transmitted.
The MPEG-4 Visual rectangular video objects encoder architecture is presented in Figure A.19.
Figure A.19 – Basic MPEG-4 Visual encoder architecture (for rectangular video objects) [60].
MPEG-4 Visual still objects, also called visual textures, are coded based on a wavelet transform coding solution,
similar to the one adopted by the JPEG 2000 standard.
102
A.7.3. Transform and Quantization
MPEG-4 Visual also uses a 2-D DCT to transform the 8×8 blocks that compose a macroblock. In MPEG-4
Visual, it is possible to quantize the DCT coefficient in two ways:
MPEG-2 Video quantization – The first quantization method is derived from the quantization used in
MPEG-2 Video. This method takes in account the properties of the human visual system, allowing a
different quantization step for each transform coefficient by means of quantization matrices. The default
MPEG-4 Visual quantization matrices are shown in Figure A.20.
8 17 18 19 21 23 25 27
17 18 19 21 23 25 27 28
20 21 22 23 24 26 28 30
21 22 23 24 26 28 30 32
22 23 24 26 28 30 32 35
23 24 26 28 30 32 35 38
25 26 28 30 32 35 38 41
27 28 30 32 35 38 41 45
16 17 18 19 20 21 22 23
17 18 19 20 21 22 23 24
18 19 20 21 22 23 24 25
19 20 21 22 23 24 26 27
20 21 22 23 25 26 27 28
21 22 23 24 26 27 28 30
22 23 24 26 27 28 30 31
23 24 25 27 28 30 31 33
Default weighting matrix forintra coded MBs
Default weighting matrix forinter coded MBs
Figure A.20 – Default MPEG-4 Visual quantization matrices [62].
H.263 quantization – The second quantization method is derived from the quantization used in H.263.
This method is less complex and easier to implement [62], but it only allows one step size value per
macroblock.
The selection of the quantization method to use is decided at the encoder side. This decision is then transmitted
to the decoder as side information. For intra-coded blocks, the DC coefficient is quantized using a fixed
quantization step size.
As mentioned before, MPEG-1 Video predicts the DC coefficients values with the values of neighbor blocks DC
coefficients. For some of the DC and AC coefficients of neighboring blocks, there exist statistical dependencies,
i.e., the value of one block can be predicted from the corresponding value of one of the neighboring blocks. This
fact is exploited in MPEG-4 video coding by the so-called DC/AC prediction. It should be noted that this
prediction is only applied in the case of intra-coded macroblocks. The idea behind the DC/AC prediction tool is
presented in Figure A.21.
Figure A.21 – DC/AC prediction process for intra-coded macroblocks [62].
For the scanning of the DCT coefficients, which corresponds to a 2D-to-1D conversion of the DCT coefficients
information, there are two additional scanning modes available, besides the traditional zigzag scanning used in
most standards, see Figure A.22.
For boundary macroblocks, this standard also supports the usage of a special transform called Shape-Adaptive
DCT. Basically, the aim of this transform is to code only the opaque pixels within the boundary macroblocks
which are not completely filled with texture data [62].
A
B C D
X MacroblockY
or or
DC value First AC row
value
First ACcolumn
value
103
Figure A.22 – Alternative MPEG-4 scanning modes for converting the 2D coefficients matrix into a 1D vector of
DCT coefficients [62].
A.7.4. Performance Evaluation
The main target of the MPEG-4 Visual standard was not additional compression efficiency; however, some of
the MPEG-4 Visual profiles, specifically targeting frame based video coding, provide some compression
efficiency benefits regarding some of the previous standards due to the additionally included coding tools.
For higher bitrates (i.e. 5 Mbit/s to 15 Mbit/s), MPEG-2 Video is already a well performing standard and, thus,
for this range of bitrates, MPEG-4 Visual does not bring any significant improvement. However, as referred
above, for low and medium bitrates (i.e. up to 3 Mbit/s), MPEG-2 Video does not assure a good compression
performance and is even outperformed by MPEG-1 Video. Therefore, for both low and medium bitrates, MPEG-
4 Visual comes as an improvement, showing some compression performance superiority regarding MPEG-1
Video, for any type of video sequence [62].
For very low bitrates (i.e. 50 kbit/s), H.263 still provides some coding gain over MPEG-4 Visual. For these
bitrates, MPEG-4 Visual does not use all of the available coding tools in order to reduce the complexity and the
delay caused by their usage; this is made through the MPEG-4 Visual Simple Profile. However, for higher
bitrates (i.e. 1.5 Mbit/s), MPEG-4 Visual provides better compression performance than H.263, making use of
all the available coding tools (e.g. B-frames, quarter-pixel motion compensation, MPEG-2-style quantization,
global motion compensation); this is provided through the MPEG-4 Visual Advanced Simple Profile [63].
A.8. H.264/AVC Standard
The H.264/AVC standard (also known as MPEG-4 AVC, MPEG-4 Part 10 or ISO/IEC 14496-10) is a video
coding standard jointly developed by the ITU-T VCEG and ISO/IEC MPEG standardization groups; its first
version was finalized around 2003 [64].
A.8.1. Objectives
This standard has the main target to provide the same quality achieved by the already available video coding
standards (e.g. MPEG-2 Video, H.263 and MPEG-4 Visual) with substantially lower bitrates, typically half or
less the bitrate which means 50% additional compression efficiency.
Additionally, it was designed to provide enough flexibility to allow its deployment in a wide range of application
scenarios, considering low to high bitrates and low to high spatial resolutions.
0 1 2 3 10 11 12 13 0 4 6 20 22 36 38 52 0 1 5 6 14 15 27 28
4 5 8 9 17 16 15 14 1 5 7 21 23 37 39 53 2 4 7 13 16 26 29 42
6 7 19 18 26 27 28 29 2 8 19 24 34 40 50 54 3 8 12 17 25 30 41 43
20 21 24 25 30 31 32 33 3 9 18 25 35 41 51 55 9 11 18 24 31 40 44 53
22 23 34 35 42 43 44 45 10 17 26 30 42 46 56 60 10 19 23 32 39 45 52 54
36 37 40 41 46 47 48 49 11 16 27 31 43 47 57 61 20 22 33 38 46 51 55 60
38 39 50 51 56 57 58 59 12 15 28 32 44 48 58 62 21 34 37 47 50 56 59 61
52 53 54 55 60 61 62 63 13 14 29 33 45 49 59 63 35 36 48 49 57 58 62 63
Alternate
horizontal scan
Alternate
vertical scan Zigzag scan
104
A.8.2. Technical Approach and Architecture
The H.264/AVC standard no longer uses the object-oriented coding paradigm introduced in MPEG-4 Visual and
has returned to the usual frame-based video coding paradigm5. It only addresses rectangular objects/frames as in
the video coding standards before MPEG-4 Visual. To achieve the proposed objective, H.264/AVC uses many
new coding tools, capable of increasing the compression efficiency, typically at the cost of increasing the
encoding and decoding complexities. The main technical improvements introduced in H.264/AVC are:
Temporal redundancy tools
o Variable block size – Unlike other standards, H.264/AVC supports various block sizes for motion
estimation. For fast moving and changing areas, smaller blocks may be adopted to increase the
motion compensation accuracy. For slow moving and changing areas, larger blocks may be adopted
to save bits.
o Quarter-pixel motion estimation – This tool was already introduced in MPEG-4 Visual to
improve the motion vectors accuracy, thus increasing the compression efficiency.
o Multiple reference frames – With H.264/AVC, it is possible to adopt multiple reference frames
for a single MB (up to 31 frames), in the past or in the future. This is useful for situations where the
neighboring frames are not the most similar to the current frame.
o Generalized B-frames - Additionally, with H.264/AVC, B-frames can also be prediction reference
for other B-frames, with or without motion compensation, removing the B-frames limitations in
terms of prediction referencing, existing since MPEG-1 Video.
Spatial redundancy and irrelevancy tools
o Transform and Quantization – There are some significant improvements in H.264/AVC
concerning the transform coding and quantization process which will be analyzed in detail in the
next section.
o Intra prediction – In contrast to the previously presented video compression standards, where the
spatial redundancy is only removed by means of transform coding, H.264/AVC can predict a intra-
coded MB using pixels from neighboring macroblocks within the same frame, using after the
transform coding to complete the process and coding the intra-prediction residual. Intra prediction
may be performed for 4×4 or 16×16 blocks. For 4×4 blocks, the intra prediction can be made in 9
different ways, depending on the correlation direction between neighbor blocks. For 16×16 blocks,
four intra coding modes are possible; this intra prediction block size is typically useful for image
areas with smooth variations.
Statistical redundancy tools
o The H.264/AVC entropy coder includes two main alternatives with difference complexities and
efficiency:
Context-adaptive binary arithmetic coding which is more complex but provides additional
compression efficiency.
Context-adaptive variable-length coding (CAVLC) which is less complex but also simpler.
This alternative also uses Exponential-Golomb (Exp-Golomb) coding; Exp-Golomb is a
common simple and highly structured Variable Length Coding (VLC) technique.
Perceptual redundancy tools
o In-loop deblocking filtering – To reduce the negative subjective impact of the blocking artifacts,
and also improve the compression efficiency, H.264/AVC uses an in-loop deblocking filter. This
filter is applied to the vertical and horizontal edges of all 4×4 blocks in a macroblock.
The H.264/AVC encoder architecture is presented in Figure A.23.
5 To be more precise, H.264/AVC specifies additional codecs for rectangular objects in the context of the object-
based MPEG-4 representation framework.
105
Figure A.23 – Basic H.264/AVC encoder architecture [17].
A.8.3. Transform and Quantization
In H.264/AVC, several transforms are specified (see Figure A.24):
A 2-D Hadamard transform of size 4×4 for the luminance DC coefficients for 16×16 intra-coded
macroblocks.
A 2-D Hadamard transform of size 2×2 for the chrominance DC coefficients of any macroblock.
A 2-D Integer DCT (ICT) of size 4×4 for all the other blocks; this is considered to be the “core”
transform.
Figure A.24 – H.264/AVC transforms [17].
The reduction of the transform block size from 8×8 (the block size used in the previous video coding standards)
to 4×4 allows a more locally-adaptive representation of the input signals. With a small sized block for motion
estimation, H.264/AVC also obtains higher temporal prediction efficiency. The ICT is based on the DCT but
with some fundamental differences:
It is an integer transform, thus, all operations can be processed with integer arithmetic, without any loss
of accuracy.
Since the inverse transform is defined by the exact integer operations, inverse-transform mismatches
between encoders and decoders should not occur.
The core part of this transform only requires additions and shifts, being easier to implement than the
DCT.
A scaling multiplication is integrated into the quantizer, reducing the total number of multiplications.
Chroma 4x4 block order for 4x4 residual
coding, shown as 16-25, and Intra4x4
prediction, shown as 18-21 and 22-25
10 4 5
2 3 6 7
8 9 12 13
10 11 14 15
-1
...
Luma 4x4 block order for 4x4 intra
prediction and 4x4 residual coding
10 4 5
2 3 6 7
8 9 12 13
10 11 14 15
-1
...
Intra_16x16
macroblock type
only: Luma 4x4 DC
2x2 DC
AC
Cb Cr16 17
18 19
20 21
22 23
24 25
2x2 DC
AC
Cb Cr16 17
18 19
20 21
22 23
24 25
Integer DCT
Hadamard
106
The block diagram for the H.264/AVC transform and quantization processes is presented in Figure A.25.
Figure A.25 – (a) Forward transform and quantization. (b) Re-scaling and inverse transform.
The forward 2-D ICT is arranged into a core transform (Cf) and a scaling matrix (Sf) defined as
(A.2)
(A.3)
In this way, the forward 2-D ICT is given by
(A.4)
where Y is the ICT coefficients matrix and X corresponds to the input block samples.
Again, the inverse 2-D ICT is arranged into a core transform (Ci) and a scaling matrix (Si) defined as
(A.5)
(A.6)
In this way, the inverse 2-D ICT is given by
(A.7)
where Z is the reconstructed block matrix.
Besides the transforms specified in the first version of H.264/AVC, a new 8×8 ICT is introduced in an extension
to the original standardization project, called the Fidelity Range Extensions (FRExt). These extensions were
developed to enable higher quality video coding. Several features were included in the FRExt project, such as an
adaptive switching between order-4 and order-8 ICTs, depending on the characteristics of the input samples.
Sometimes, a 4×4 block size can improve the temporal prediction but it may compromise the spatial compaction.
On the other hand, a 8×8 block size may allow achieving better spatial compaction while sacrificing the
temporal prediction.
107
In H.264/AVC, a quantization parameter is used to determine the quantization of the transform coefficients. This
parameter can take 52 values, which are related to the quantization step by a table. An increment of the
quantization parameter by 1 implies an increase of the quantization step by approximately 12% and a reduction
of the bitrate by approximately 12% as well [64]. The same quantization parameter is used for all the transform
coefficients in a macroblock.
A.8.4. Performance Evaluation
H.264/AVC coding provides quality similar to MPEG-2 Video coding at approximately half the bitrate.
However, the advantage of H.264/AVC diminishes as the bitrate and spatial resolution increases, such that at
very high bitrates (above 18 Mbit/s), there is very little difference between MPEG-2 Video and H.264/AVC
[65]. This basically shows that both MPEG-2 Video and H.264/AVC have been optimized for lower resolutions
(notably H.264/AVC for CIF and ITU-R 601 resolutions) and they do not perform very well for very high
resolutions, notably beyond HD.
In comparison to H.263, the previously more efficient video coding standard for low bitrates, H.264/AVC
achieves around 24% average compression gain [66]. With this performance, the H.264/AVC standard is
currently considered the state-of-the-art in video coding for a large range of applications, bitrates and
resolutions.
109
Appendix B
Recent Advances on Transform Coding
To improve the compression efficiency of predictive video coding solutions, notably for resolutions above high
definition, some new transforms have been introduced in the recent years. In this appendix, the most relevant
advances on transform coding are briefly reviewed. Each of the four solutions presented adopts a different
approach, although always using previously reviewed concepts.
B.1. Increasing the Transform Block Size
The first solution presented was proposed by Dong et al. [67] in 2009 and consists in two 2-D order-16 integer
transforms. These transforms are expected to be more efficient in exploiting the spatial correlation present in HD
video sequences than the already used transforms (particularly 2-D order-4 and order-8 transforms).
B.1.1. Objectives
A statistical analysis of the correlation between adjacent prediction error blocks, which represent the difference
between the original and the motion compensated prediction blocks, reveals that, for higher definitions, the
spatial correlation of prediction errors increases. To explore this property, the authors of this solution propose
the usage of larger blocks transforms. In particular, they propose the usage of 16×16 blocks since previous
studies prove that there is no significant improvement in the usage of even larger blocks for HD video sequences
[68]. With 16×16 blocks, it is possible to exploit better the spatial correlation between neighboring samples at
the cost of increasing the complexity to the transform process and worsening the entropy encoder (typically run-
length coding) process since it becomes non-optimal because of the much larger dynamic ranges existent for the
various runs. The proposed order-16 transforms are developed taking in consideration these advantages and
drawbacks and are not simple extensions of the already used ICT (Section A.8.3).
It is important to note that this solution still considers the order-4 and order-8 transforms as alternatives for more
detailed areas where the spatial correlation is less significant.
110
B.1.2. Architecture and Walkthrough
The transforms proposed in [67] have the same architecture as other transforms, more particularly as the ICT
transforms used in the H.264/AVC standard (see Section A.8.3). The details on the transforms are presented in
the next section.
B.1.3. Details on the Transform
The authors propose two new 2-D order-16 transforms which are both integer and derivatives of the 2-D order-
16 ICT. The general transform matrix of an order-16 ICT, T16, is defined as [67]
(B.1)
This matrix has alternating even and odd symmetry with respect to the lines. In this way, it can be defined by its
even part, T8e, and odd part, T8o, [67]
(B.2)
(B.3)
The even part is an order-8 ICT, as the order-8 transform used in H.264/AVC, and its element set is given by
(B.4)
To maintain the orthogonality of the transform matrix, the element set of the odd part has to be represented with
large magnitudes, with at least 6 bits; however, this significantly increases the associated computational
complexity. To avoid this complexity, while maintaining the idea of better exploiting the spatial redundancy, the
authors developed the following transforms:
2-D order-16 Non-orthogonal ICT (NICT) – To reduce its complexity, this transform uses values for
the element set of the odd part that do not guarantee the orthogonality of the transform. This is a trade-
off between complexity and performance, since a non-orthogonal transform does not have the best
energy compaction performance. However, the proposed NICT conserves all the other ICT properties,
111
such as bit-exact implementation and a rather low complexity. The element set of the even part is a
scaled version of the transform matrix of the order-8 ICT from H.264/AVC given by
(B.5)
However, a non-orthogonal transform does not have a perfect reconstruction as the reconstruction errors
can even be larger than the errors introduced by the quantization process. Thus, to define the element
set of the odd part, various solutions have been analyzed to find the one with the best balance between
the approximation to the DCT performance and the magnitudes of the used values (related to the
computational complexity). With this in mind, the authors proposed the following solution:
(B.6)
This element set was selected from a group of sets tested in order to determine their DCT distortions
and upper bounds of , which is the average variance of the reconstructed error, as shown in Table
B.1.
Table B.1 – Performance comparison of various element sets [67].
2-D order-16 Modified ICT (MICT) – The second order-16 transform proposed is obtained by
modifying the structures of the order-16 ICT matrix, thus taking the name of modified ICT. This
modification is performed using the principle of dyadic symmetry6. The even part of the transform
matrix remains unaltered with the element set in Eq.(B.2), while the odd part is given by [67]
(B.7)
Since the MICT is based on the ICT, its basis vectors are inherently orthogonal no matter what the
element sets are. With this property, it is possible to select smaller magnitude elements without losing
the orthogonality. Thus, to select the best element set, it is important to obtain a trade-off between the
performance and the magnitude of the elements (related to the computational complexity). In this
solution, the authors decided to select the following element set for the odd part
6 A vector of 2
m elements is said to have Sth dyadic symmetry if , where is the
„exclusive-OR‟, j lies in the range [0, 2m-1], S lies in the range [1, 2
m-1] and c is a constant determining the type
of dyadic symmetry, i.e., if c = 1 then the symmetry is said to be „even‟ and if c = -1 then the symmetry is said to
be „odd‟.
112
(B.8)
To make this selection, the authors considered three conditions. First, the magnitudes should be
comparable to the magnitudes of the even part set; second, the MICT basis vectors waveform should
resemble the DCT one; and third, the selected set should be suitable for a fast algorithm.
As referred above, the NICT inherits the ICT fast algorithm. However, for the MICT, the authors had to develop
a new fast algorithm that is described with more detail in [67].
B.1.4. Performance Evaluation
To objectively evaluate the proposed order-16 integer transforms, they were integrated in the H.264/AVC
reference software, more specifically in the H.264/AVC High Profile. The tests were made with the conditions
listed in Table B.2.
Table B.2 – Test conditions for the NICT and MICT [67].
Platform JM11
Sequence structure IBBPBBP…
Intra frame period 0.5 s
Entropy coding Arithmetic coding
Fast motion estimation On
Deblocking filter On
R-D optimization On
Quantization Parameter Fixed (20, 24, 28, 32)
Rate control Off
Reference frame 5
Search range ±32
Frame number 60
The experimental results shown in Table B.3 allow analyzing the performance gains of the two proposed
transforms in comparison with H.264/AVC when using both the 2-D order-4 and order-8 ICTs. The
improvements are measured in terms of PSNR gain for the same bitrate or in terms of bitrate saving for the same
quality (the same PSNR).
Table B.3 – Experimental results of the proposed NICT and MICT versus H.264/AVC [67].
For the NICT, the performance gain is, on average, more than 0.2 dB. For all tested sequences, the improvement
is larger than 0.1 dB and the maximum gain is achieved for the sequence “Riverbed” with a gain up to 0.48 dB.
113
This sequence has smooth textures and global motion, so the prediction errors have low energy resulting in low
amplitude coefficients when using an order-16 transform.
For the MICT, the gains are smaller than for the NICT, on average around 0.06 dB. This transform does not
outperform the NICT for any HD sequence, but in some cases it comes very close, e.g. Crew, Pedestrian,
RushHour and Sunflower.
For both cases, the usage of a 2-D order-4 ICT does not bring notorious performance gains or losses. This is
confirmed in Figure B.1, where the percentage of macroblocks coded with each transform is shown for four
cases: (a) using order-8 ICT and order-16 NICT; (b) using order-8 ICT and order-16 MICT; (c) using order-4
and order-8 ICTs and order-16 NICT; (d) using order-4 and order-8 ICTs and order-16 MICT.
Figure B.1 – Proportion of different block size transforms for the HD sequences City, Crew, Station and
Sunflower [67].
As expected, the percentage of macroblocks using the 2-D order-4 ICT is very small for both the NICT (c) and
the MICT (d). Thus, the authors proposed a variable block size scheme that does not include an order-4
transform for HD video content. The importance of order-16 transforms for HD video coding is well shown in
Figure B.1 where, on average, more than half of the macroblocks are coded using these type of transforms. For
some HD sequences (e.g. Sunflower and Station), order-16 transforms are used in up to 80% of the macroblocks.
Figure B.1 also shows that, with the increase of the bitrate (reduction of the QP), the order-16 transforms are less
used. This is because at high bitrates more high frequency coefficients are transmitted, resulting in larger runs
for entropy encoding due to the large block size of order-16 transforms. Thus, for these cases, the order-8 ICT is
more likely to be selected.
To evaluate the subjective improvements associated with the proposed transforms, the authors used two cropped
images (150×150 pixels) from two video sequences: City (720p) and Station (1080p). The tests were made with
a QP of 32, and the experimental results shown in Figure B.2 indicate for each image the number of bits and the
associated PSNR.
114
Figure B.2 – Images cropped from City and Station using (a) and (d) H.264/AVC, (b) and (e) H.264/AVC with
additional 2-D order-16 NICT and (c) and (f) H.264/AVC with additional 2-D order-16 MICT [67].
For low bitrates (QP = 32), the usage of the order-16 transforms allow to better preserve the details, providing
better visual quality. This is noticeable in the vertical edges of the buildings in the sequence City and the
horizontal edges of the railway sleepers in the sequence Station. For higher bitrates, the quality achieved without
the usage of the order-16 transforms is good enough; thus, the usage of the order-16 transforms does not bring
any noticeable improvements.
B.1.5. Summary
This solution proposes the usage of order-16 transforms to better exploit the spatial correlation for HD videos,
which tend to have more spatial redundancy than lower resolution videos. To this end, two order-16 integer
transform are proposed: one non-orthogonal ICT and a modified ICT. Both allow a more free selection of the
transform matrix elements with this selection made having in mind a compression performance versus
complexity trade-off.
The developed transforms have been integrated in the H.264/AVC standard, along with the order-4 and order-8
ICTs. This variable block size scheme is later reduced by removing the order-4 ICT since it is proven not be very
useful for HD video. The experimental results show that both 2-D order-16 integer transforms can improve the
current H.264/AVC coding efficiency, particularly for HD video coding.
B.2. Directional Discrete Cosine Transforms
The second novel transform solution to be reviewed in this section is based on one of the directional transform
approaches mentioned in Section 2.1.5 and was proposed by Zeng and Fu in 2008 [69]. This transform uses a
directional DCT to provide a better compression performance for image blocks containing directional edges.
B.2.1. Objectives
The main objective of this novel transform is to better exploit the spatial correlation within each block,
particularly when the block has some directional edges other than horizontal and vertical. Currently, the 2-D
DCT (or ICT for the H.264/AVC case) used in most image and video coding standards only exploits the
correlation in terms of vertical and horizontal edges, performing two separable 1-D transforms in both directions.
This is useful since human eyes are highly sensitive to vertical and horizontal edges and a lot of image blocks do
115
contain these types of edges. However, many images have other directional edges, whose spatial redundancy is
not totally exploited with the currently used non-directional transforms.
B.2.2. Architecture and Walkthrough
The transform introduced in this solution is performed in three steps illustrated in Figure B.3.
Figure B.3 – Transform architecture.
A short walkthrough of the transform is presented next:
1. 1-D Directional DCT – First, a 1-D DCT is performed along the direction of the detected edge; the
DCT coefficients are then arranged into a group of column vectors.
2. 1-D Horizontal DCT – Next, the second 1-D DCT is applied to each row; the resulting coefficients are
then pushed horizontally to the left in order to facilitate the next step.
3. Modified Zigzag Scan – Finally, the DCT coefficients are zigzag scanned to convert them into a 1-D
sequence to be used for run-length based VLC.
To better understand this process, consider an 8×8 image block with a vertical-right edge. The illustration of the
three steps above is shown in Figure B.4.
Figure B.4 – (a) 1-D Directional DCT along the vertical-right direction in the first step. (b) 1-D Horizontal DCT
in the second step. (c) Modified Zigzag Scan in the last step [69].
As referred in Section 2.1.5 and illustrated in Figure B.4 (b), the transform performed in the second step is
horizontal since the first row contains all DC coefficients and each of the other rows contains all AC coefficients
with the same index.
B.2.3. Details on the Transform
Taking advantage of the directional intra prediction modes used in H.264/AVC, the 1-D directional DCT
included in this solution uses six directional modes defined in a similar way. These modes are presented in
Figure B.5 for a 8×8 block. It must be noted that Mode 0 (vertical prediction) and Mode 1 (horizontal prediction)
are not defined since these directions are already exploited in the non-directional 2-D DCT. Naturally, Mode 2
(the dc mode) is not used since it is not a directional mode.
116
Figure B.5 – Six directional modes similar to those used in H.264/AVC intra prediction for the 8×8 block size
[69].
To perform these six directional transforms, only two basis functions are necessary. In this case, only the basis
functions for Mode 3 DCT, which is a directional DCT performed along the direction defined by Mode 3, and
Mode 5 DCT, which is a direction DCT performed along the direction defined by Mode 5, are defined, besides
the basis functions for the non-directional DCT (see Figure B.6). The basis functions for the other prediction
modes may be easily obtained by a symmetric transformation (flipping or transposing) applied to the Mode 3
and 5 basis functions. Mode 4 can be obtained by flipping Mode 3 either horizontally or vertically; Mode 6 can
be obtained by transposing Mode 5, and Mode 7/8 can be obtained by flipping Mode 5/6, either horizontally or
vertically.
Figure B.6 – Basis function images for the non-directional DCT (Mode 0/1), Mode 3 DCT and Mode 5 DCT for
a 8×8 block size [69].
A directional DCT (chosen from Modes 3-8) cannot be applied directly to the image blocks, because they would
suffer from the so-called mean weighting defect. This defect is related to the different weighting factors used in
the various transforms applied to a block, which can produce more non-zero AC coefficients than needed. To
solve this problem, this solution proposes the utilization of a DC correction method which comprises by two
steps:
1. DC separation
First, the mean value m of a block is computed and quantized like the DC component of the non-
directional 2-D DCT. Then, m is subtracted from the initial block samples resulting in
. Next, the transforms are performed, as illustrated in Figure B.3, and the resulting coefficients are
pushed horizontally to the left and denoted as for .
2. ΔDC correction
In this step, the DC component is set to zero while all the other coefficients are quantized and denoted as
with . Next, in the inverse transform process, the first IDCT is applied to each row of
, and the
resulting coefficients are denoted as . Then, a ΔDC correction term is computed as
117
(B.9)
where is the length of the kth column. The correction term is then subtracted from each
for
. After the ΔDC correction, the second IDCT is performed on each column of and the
results are placed back in the corresponding diagonal down-left line to generate a reconstructed N×N block.
Finally, the quantized mean value is added back to the reconstructed block.
B.2.4. Performance Evaluation
To assess the performance of the proposed transforms, the authors selected four video sequences: Akiyo,
Foreman, Stefan and Mobile (all in CIF format). The first frames of these video sequences are shown in Figure
B.7. The video sequences were coded with H.263‟s quantization/VLC while fixing the block size at 8×8.
Figure B.7 – First frames of the selected video sequences [69].
The RD performance results (PSNR versus bit/pixel) comparing the use of a non-directional DCT (called here
Conventional DCT) and the proposed directional DCT are shown in Figure B.8.
Figure B.8 – RD performance for the first frames of Akiyo, Foreman, Mobile and Stefan [69].
The results in Figure B.8 show that only a very marginal RD performance gain has been achieved. This is due to
the fact that most blocks have chosen the prediction Mode 0 or 1 (for which there is zero gain). In this context, to
better show the contribution of Modes 3-8, the RD performance fir these modes is isolated in the charts
presented in Figure B.9 meaning that only the blocks selecting Modes 3-8 are considered.
118
Figure B.9 – RD performance for the first frames of Akiyo, Foreman, Mobile and Stefan when only the blocks
selecting Modes 3-8 are considered [69].
By isolating the results for Modes 3-8, it is possible to observe a clear gain associated to the directional DCT
over the non-directional DCT performance. This gain is more noticeable for the Akiyo and Foreman cases, where
the gain ranges from about 0.5 dB in the high bitrates to about 2 dB in the low bitrates.
To analyze the results of the directional DCT for motion-compensated residual frames, the motion vectors
between frames 2 and 3 and between frames 50 and 51 of Foreman and Mobile are generated using a search
window of size ±7×±7. Then, as before, the directional and non-directional DCTs are applied. The experimental
results are presented in Figure B.10 and Figure B.11, with the later considering only the blocks using Modes 3-8.
Figure B.10 – RD performance for the motion compensated residual frames of Foreman and Mobile [69].
119
Figure B.11 – RD performance for the motion compensated residual frames of Foreman and Mobile when only
the blocks selecting Modes 3-8 are considered [69].
From the observation of Figure B.10 and Figure B.11, it is clear that RD performance gains are also achieved for
all residual frames. Compared to the intra-coding, the coding gain becomes even more significant; thus, the
directional transform seems to be even more useful for inter-coding.
B.2.5. Summary
This solution proposes a block-based directional DCT which takes into consideration the direction of the block
edges in a digital image. With this directional transform, it is possible to exploit the directional edges existent in
a particular block, besides the horizontal and vertical directions. This is done using a 1-D transform applied in
the direction of the edge and a second 1-D transform applied in the horizontal direction. In this solution, the
novel directions used are based on the intra prediction modes used in H.264/AVC. Experimental results show
that this transform can achieve relevant compression gains regarding non-directional transforms, specially for
images with significant directional information.
B.3. 3-D Spatial and Temporal Transform
The third solution to be reviewed in this section was proposed by Furht et al. in 2003 [70] and it involves a 3-D
transform like those introduced in Section 2.1.4.
B.3.1. Objectives
In a video sequence, besides the spatial correlation within each frame there is also the temporal correlation
between neighbor frames. To exploit this correlation, this novel transform adopts a 3-D DCT. However, for
video sequences with high motion, the performance of a 3-D transform may be highly degraded since there is not
much temporal correlation. To solve this problem, the authors of the solution proposed an adaptive cube-size 3-
D DCT technique that dynamically performs motion analysis to adapt in accordance the size of the video cube to
be transformed and compressed.
B.3.2. Architecture and Walkthrough
The architecture of the adaptive 3-D DCT encoder including the proposed 3-D transform is presented in Figure
B.12.
120
Figure B.12 – Architecture of the adaptive 3-D DCT encoder [70].
A short walkthrough of this architecture is presented next:
1. Motion analyzer – First, the video sequence is analyzed to determine the level of motion. To perform
this analysis, 16×16×8 video cubes are used where the third dimension is time. There are three levels of
motion specified: no motion, low motion and high motion.
2. Selection of the cube size – Based on the determined level of motion, the adequate cube size is
selected. For high motion video, the spatial size of the cubes is reduced to prevent the degradation of
the image quality. Naturally, this operation also leads to a lower compression rate for a target quality.
3. Forward 3-D DCT – Next, the 3-D DCT is applied to the selected video cube. This transform is
described with more detail in the next section.
4. Quantization – The coefficients are then quantized to exploit the visual irrelevancy. The quantization
step depends on the type of motion, i.e., for high motion cubes the quantization step is lower than for
low motion cubes.
5. Huffman encoding – Finally, the resulting quantized coefficients are entropy encoded using a lossless
variable length coding Huffman algorithm.
For the 3-D DCT decoder, the encoder steps are performed in the reverse order, except the motion analysis.
B.3.3. Details on the Transform
There are two main tools introduced in this 3-D transform based video coding solution, which are now presented
with more detail:
Forward 3-D DCT – As noted in Section 2.1.4, to perform a 3-D transform it is necessary to divide the
video data in 3-D video cubes. Considering that Nc×Nr is a block of pixels in a frame and Nf is the
number of successive frames, the video cube has size Nc×Nr×Nf. In this way, the forward 3-D DCT
used in this solution is defined as
(B.10)
where
(B.11)
121
Motion analysis and selection – As noted before, fixed 16×16×8 video cubes are used for the motion
analysis. To determine the level of motion for each 16×16×8 video cube, the Normalized Pixel
Difference (NPD) between the first and the eight frame is computed as
(B.12)
where X(i)1 are pixels from the first frame, X(i)8 are pixels from the eight frame and N is the total
number of pixels in a 16×16 block (N=256). The motion levels are then defined as
(B.13)
where t1 = 5 and t2 = 25. The values of t1 and t2 were selected based on a set of extensive experiments.
The cube sizes used for each motion level are shown in Table B.4 and are explained next:
Table B.4 – Cube Size for each Motion Level.
Motion Level Cube Size
No motion 16×16×1
Low motion 16×16×8
High motion 8×8×8
o No motion – When there is no motion detected by the motion analyzer, the 3-D DCT is
applied to a 16×16×1 cube. Basically, this means that a 2-D DCT is applied to the 16×16 block
in the first frame only since the remaining blocks are very similar. In the decoding process, the
reconstruction of the corresponding block in the other seven frames will be replicated from the
first frame.
o Low motion – If there is low motion detected, the cube size remains unchanged and the 3-D
DCT is applied to a 16×16×8 cube; this allows an improved compression ratio while
maintaining a high quality.
o High motion – When the motion analyzer detects high motion, the cube is subdivided into
8×8×8 cubes and the 3-D DCT is then applied. With this approach, it is possible to achieve a
better quality versus rate trade-off.
As noted in [70], another motion level could be included for cubes with even higher motion than the one defined
before (t2). With this additional level, these higher motion cubes could use a 4×4×8 3-D DCT.
B.3.4. Performance Evaluation
To evaluate the performance of the proposed adaptive 3-D DCT, the novel transform, solution was applied to
two video sequences: Security, which is a low motion video sequence, and Football, which is a high motion
video sequence. The performance was assesses using the compression ratio, the number of bits/pixel and the
PSNR. The quantization tables (QT) were created using the following formula:
(B.14)
where Q(i,j) are so-called quantization coefficients and quality specifies the quality factor. The quality factor
recommended range is from 1 to 25, with 1 corresponding to the best quality.
The results achieved when applying the adaptive 3-D DCT to eight frames of the sequence Security are
presented in Table B.5 while Figure B.13 shows some example decoded frames.
122
Table B.5 – Adaptive 3-D DCT applied to Security sequence [70].
Figure B.13 – First frame for the Security sequence: (a) original, (b) quality=5, (c) quality=10 and (d)
quality=20 [70].
The experimental results show that the proposed adaptive 3-D DCT can provide better compression (or a lower
number of bits/pixel) while maintaining a good video quality. For a quality factor of 20, the video quality suffers
from the high quantization steps, showing some quality artifacts.
Next, for the video sequence Football, the authors also assessed the performance of the non-adaptive 3-D DCT
using 8×8×8 video cubes (besides the adaptive 3-D DCT). This video sequence has 56 frames, and the motion
analyzer detected 1116 (40%) video cubes with high motion, 875 (32%) with low motion and 781 (28%) with no
motion. The results of these experiments are shown in Table B.6 and Table B.7.
Table B.6 – Adaptive 3-D DCT applied to Football sequence [70].
Table B.7 – Non-adaptive 3-D DCT applied to Football sequence [70].
In comparison to the non-adaptive 3-D DCT, the adaptive 3-D DCT shows a superior performance, achieving
higher compression ratios while maintaining a similar quality. In the last row of Table B.6, different quality
123
factors are used for different motion levels. For high motion cubes, the quality factor is 5 while for low and no
motion cubes the quality factor is 10. This selective quantization approach results in a similar distortion/quality
compared to the usage of a fixed quality factor of 5 while the compression ratio is much higher.
B.3.5. Summary
The solution presented in this section uses an adaptive 3-D DCT technique for video compression. This means
that the size of the 3-D transform is variable, depending on the level of motion in each particular video sequence.
For low motion sequences, this approach can obtain compression ratios from 1:300 to 1:400 while still
maintaining a relatively good video quality. This may be useful for low motion applications, such as
videotelephony, videoconference, surveillance, etc. Even for higher motion video sequences, this solution can
achieve compression ratios in the range of 80-150 while providing a high quality. This may be useful for
applications such as Digital TV and HDTV.
As referred during this review, this solution can be further improved with the addition of more motion levels and
the consequent extension of the video cubes size, using smaller sizes for higher motion cubes. It can also be
improved by using adaptive quantization tables depending on the motion level.
B.4. Multi-Dimensional Spatial Transform
The last solution to be reviewed in this section was developed by Choi et al. in 2008 [71] and proposes a so-
called Multi-Dimensional Transform (MDT).
B.4.1. Objectives
The main objective of the new transform reviewed here is to better exploit the spatial redundancy between
neighbor blocks in a video sequence. This is done by means of a novel MDT tool which exploits the correlation
between neighbor blocks, besides the correlation within blocks. This can greatly improve the compression
performance in comparison to the current state-of-the-art on video coding, the H.264/AVC standard. As referred
in Section A.8, H.264/AVC uses a 4×4 ICT. This locally-adaptive approach is useful to provide high temporal
prediction efficiency; however, because of the small block size used, the spatial redundancy reduction is limited.
With the MDT proposed in this solution, the authors target further exploiting the spatial redundancy while
maintaining the H.264/AVC temporal redundancy reduction capacity.
B.4.2. Architecture and Walkthrough
The developed MDT may have three (3DT) or four (4DT) dimensions. There are two types of 3DT: horizontal
direction 3DT (H3DT) and vertical direction 3DT (V3DT). The H3DT is applied to 16×8 sub-macroblocks and
the V3DT is applied to 8×16 sub-macroblocks. The 3DT block diagram is presented in Figure B.14.
Figure B.14 – Block diagrams for (a) H3DT and (b) V3DT [71].
124
Next, a short walkthrough of the H3DT process is presented; for the V3DT, the walkthrough is similar with the
exception of the direction.
1. Block rearrangement – After each 4×4 block is transformed using a 2-D transform (like in
H.264/AVC), the resulting coefficients are grouped in sixteen 4×1 arrays including the coefficients in
the same position of each of the four blocks.
2. 1-D transform – Next, these arrays are transformed using a 1-D transform.
3. Block reconstruction – Finally, sixteen coefficients corresponding to the same position among sixteen
4×1 blocks are collected in a 4×4 block.
For each sub-macroblock, this process is performed twice. The 4DT is performed on 16×16 macroblocks. The
4DT block diagram is presented in Figure B.15.
Figure B.15 – Block diagrams for the 4DT [71].
A short walkthrough of the 4DT process is presented next:
1. Block rearrangement – After performing a 2-D transform over all sixteen 4×4 blocks (like in
H.264/AVC), the resulting coefficients corresponding to the same spatial frequency are arranged in
sixteen 4×4 blocks.
2. 2-D transform – Next, a 2-D transform is performed over each 4×4 transform coefficient block.
Thus, both 3DT and 4DT produce 4×4 coefficients for each coefficient position among the sixteen 4×4 blocks.
B.4.3. Details on the Transform
The proposed MDT is an integer-based transform. To implement the integer calculation of the transform, the
MDT can be divided into a core-transform part and a post-scaling part. For the 3DT, the core-transform part
consists of a H.264/AVC 2-D ICT and an additional 1-D ICT. The post-scaling part is separated in a 2-D post-
scaling and a 1-D post-scaling (see Figure B.16).
Figure B.16 – Core-transform and Scaling parts for the 3DT [71].
To better understand the process by which the 3DT and the 4DT are obtained, the previously studied 2-D DCT
and ICT are defined again. The 4×4 forward DCT can be computed as
(B.15)
where
125
(B.16)
The 4×4 forward ICT in H.264/AVC can be computed as
(B.17)
where CXCT is a 2-D core-transform, Ef is a matrix of scaling factors and the symbol indicates that each
element of CXCT is multiplied by the scaling factor in the same position in matrix Ef. In this way, the 3DT can be
represented by simply adding both the 1-D core-transform and the 1-D post-scalling,
(B.18)
where W is the calculated matrix of CXCT. Considering RT the matrix computed after the 2-D and 1-D core-
transforms, the 3DT can be represented by
(B.19)
where the scaling process for the 3DT can be represented by
(B.20)
with
.
In this solution, the authors have also designed a MDT quantizer. Considering that Zij is the quantized value of a
3DT coefficient, it is defined by
(B.21)
126
where Qstep is a quantization step size. Considering now
as the result of the transform and quantization
module, it can be expressed as
(B.22)
where
(B.23)
(B.24)
(B.25)
For the 4DT, the derivation process corresponds to the simple expansion of the 3DT process.
B.4.4. Performance Evaluation
To assess the performance of the proposed MDT, four video sequences have been used: Foreman, Harbour,
Carphone and Container. All these sequences consist of 300 frames, at CIF resolution, encoded at 30 frames/s.
The MDT is integrated in H.264/AVC and its baseline profile is used to code each sequence; only the first frame
is intra coded. In terms of quantization, five different quantization parameters (QP) were used: 32, 35, 38, 42 and
45. These experiments were also made using the H.264/AVC transform and quantizer. The results of these
experiments are shown in Figure B.17.
Figure B.17 – RD performance of the MDT versus the H.264/AVC 4×4 transform [71].
Figure B.17 shows that the usage of the MDT proposed in this solution brings a clear performance gain in
comparison to the 4×4 ICT and the quantization process used in H.264/AVC. With the MDT, it is possible to
achieve a quality improvement of 1-2 dB for QP above 24.
127
B.4.5. Summary
In this solution, the authors propose a MDT with high energy compaction capabilities. The MDT considers three
modes which are used depending on the block size: a H3DT for 16×8 sub-macroblocks, a V3DT for 8×16 sub-
macroblocks and a 4DT for 16×16 macroblocks. The experimental results show that this transform can provide
better compression efficiency than H.264/AVC, thus offering more quality for the same bitrate.
129
References
[1] Mobile phone image: http://larryfire.wordpress.com/2009/01/19/youtube-offering-ipod-ready-video-
downloads/.
[2] Personal computer image: http://www.bell.ca/shopping/PrsShpTv_TV_online.page.
[3] LCD TV image: http://www.123brackets.co.uk/blog/2008/11/29/disposing-of-an-old-or-broken-plasma-or-
lcd-tv/.
[4] Ultra-high definition TV image: http://www.hdtvinfo.eu/news/hdtv-articles/82-inch-ultra-hd-lcd-tv-from-
samsung.html.
[5] Compression artifact: http://en.wikipedia.org/wiki/Compression_artifact.
[6] A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression and Standards, 2nd ed.
New York: Plenum Press, 1995.
[7] R. Westwater and B. Furht, Real-Time Video Compression - Techniques and Algorithms. Norwell, United
States of America: Kluwer Academic Publishers, 1997.
[8] Principal component analysis: http://en.wikipedia.org/wiki/Principal_component_analysis.
[9] Temics: Aurélie Martin: http://www.irisa.fr/temics/staff/martin/.
[10] Fast Fourier transform: http://en.wikipedia.org/wiki/Fast_Fourier_transform.
[11] Discrete cosine transform: http://en.wikipedia.org/wiki/Discrete_cosine_transform.
[12] Fast Hadamard transform: http://en.wikipedia.org/wiki/Fast_Hadamard_transform.
[13] Discrete wavelet transform: http://en.wikipedia.org/wiki/Discrete_wavelet_transform.
[14] F. Pereira. Digital Image Compression:
http://www.img.lx.it.pt/~fp/cav/ano2009_2010/Slides%202010/CAV_5_Digital_Pictures_2010_Web.pdf.
[15] M. Biswas, M.R. Pickering, and M.R. Frater, "Improved H.264-Based Video Coding Using an Adaptive
Transform," in Proceedings of 2010 IEEE 17th International Conference on Image Processing, Hong
Kong, September 2010, pp. 165-168.
[16] P. Waldemar, S.O. Aase, and J.H. Husoy, "A Critique of SVD-based Image Coding Systems," in
Proceedings of the 1999 IEEE International Symposium on Circuits and Systems, Orlando, FL, USA, July
130
1999, pp. 13-16.
[17] F. Pereira. Advanced Multimedia Coding:
http://www.img.lx.it.pt/~fp/cav/ano2009_2010/Slides%202010/CAV_9_Advanced_Compression_2010_W
eb.pdf.
[18] M.R. Pickering, Optimum Basis Function Estimation for Inter-frame Prediction Errors, 2010, Internal
document.
[19] H.264/AVC Software Coordination: http://iphome.hhi.de/suehring/tml/.
[20] HEVC software coordination: http://hevc.kw.bbc.co.uk/trac/browser/tags/0.9.
[21] JCT-VC, "Draft call for proposal on High-Performance Video Coding (HVC)," in Doc. N11113, Kyoto, JP,
January 2010.
[22] JCT-VC, "Suggestion for a Test Model," in JCTVC-A033r1, 1st meeting, Dresden, Germany, April 2010.
[23] M. Naccari and F. Pereira, "Integrating a Spatial Just Noticeable Distortion Model in the Under
Development HEVC Codec," in International Conference on Acoustics, Speech and Signal Processing,
Prague, Czech Republic, May 2011.
[24] M. Naccari, Recent Advances on High Efficiency Video Coding (HEVC), 2010, Internal document.
[25] W.H. Chen, C. Smith, and S. Fralick, "A Fast Computational Algorithm for the Discrete Cosine
Transform," IEEE Transactions on Communications, vol. 25, no. 9, pp. 1004-1009, September 1977.
[26] MATLAB: http://www.mathworks.com/products/matlab/.
[27] Discrete cosine transform matrix: http://www.mathworks.com/help/toolbox/images/ref/dctmtx.html.
[28] Rotation matrix: http://en.wikipedia.org/wiki/Rotation_matrix.
[29] Sine function: http://www.mathworks.com/help/techdoc/ref/sin.html.
[30] Cosine function: http://www.mathworks.com/help/techdoc/ref/cos.html.
[31] Reshape function: http://www.mathworks.com/help/techdoc/ref/reshape.html.
[32] Eigenvalues and eigenvectors function: http://www.mathworks.com/help/techdoc/ref/eig.html.
[33] Quantization (singal processing): http://en.wikipedia.org/wiki/Quantization_(signal_processing).
[34] 4x4 Transform and Quantization in H.264/AVC: http://www.vcodex.com/h264transform4x4.html.
[35] F. Pereira. Digital Image Compression:
http://amalia.img.lx.it.pt/~fp/cav/ano2010_2011/Slides%202011/CAV_5_Digital_Pictures_2011_Web.pdf.
[36] LZ77 and LZ78: http://en.wikipedia.org/wiki/LZ77_and_LZ78.
[37] Data compression LZ77: http://jens.quicknote.de/comp/LZ77-JensMueller.pdf.
[38] JCT-VC, "Common Test Conditions and Software Reference Configurations," in JCTVC-B300, 2nd
meeting, Geneva, Switzerland, July 2010.
[39] Peak signal-to-noise ratio: http://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio.
[40] G. Bjontegaard, "Calculation of the Average PSNR Differences Between RD-curves," in 13th VCEG-M33
meeting, Austin, TX, USA, April 2001.
[41] G. Valenzise. Bjontegaard metric: http://home.dei.polimi.it/valenzise/software.htm.
[42] ITU-T, Recommendation T.81, 1992.
[43] JPEG: http://en.wikipedia.org/wiki/JPEG.
131
[44] M.W. Marcellin, M.J. Gormish, A. Bilgin, and M.P. Boliek, "An Overview of JPEG-2000," Proc. of IEEE
Data Compression Conference, pp. 523-541, 2000.
[45] A.N. Skodras, C.A. Christopoulos, and T. Ebrahimi, "JPEG2000: The upcoming still image compression
standard," Elsevier Science B.V., 2001.
[46] JPEG 2000: http://en.wikipedia.org/wiki/JPEG_2000.
[47] C. Christopoulos, A. Skodras, and T. Ebrahimi, "The JPEG2000 Still Image Coding System: An
Overview," IEEE Transactions on Consumer Electronics, vol. 46, no. 4, pp. 1103-1127, November 2000.
[48] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi, "The JPEG 2000 Still Image
Compression Standard," IEEE Signal Processing Magazine, pp. 36-58, September 2001.
[49] M. Liou, "Overview of the px64 kbit/s Video Coding Standard," Communication of the ACM, vol. 34, no.
4, pp. 59-63, April 1991.
[50] M. Handley, H.261 Video: http://www.cs.ucl.ac.uk/teaching/GZ05/08-h261.pdf.
[51] H.261 Video Coding: http://www-mobile.ecs.soton.ac.uk/peter/h261/h261.html.
[52] MPEG-1: http://en.wikipedia.org/wiki/MPEG-1.
[53] MPEG-1: http://www.cs.ucf.edu/courses/cap6411/MPEG-1.PDF.
[54] F. Pereira. Digital Video Storage:
http://www.img.lx.it.pt/~fp/cav/ano2009_2010/Slides%202010/CAV_7_AV_Storage_2010_Web.pdf.
[55] T. Von Roden, "H.261 and MPEG1 - A Comparison," Conference Proceedings of the 1996 IEEE Fifteenth
Annual International Phoenix Conference on Computers and Communications, pp. 65-71, March 1996.
[56] MPEG-2 Part 2: http://en.wikipedia.org/wiki/H.262/MPEG-2_Part_2.
[57] S. Liu, "Performance Comparison of MPEG1 and MPEG2 Video Compression Standards," IEEE
Proceedings of COMPCON, pp. 199-203, 1996.
[58] L. Maki, Video Compression Standards:
http://www.cctvone.com/pdf/FAQ/Video%20Compression%20Standards%20Journal.pdf.
[59] ITU-T, Recommendation H.263, 1996.
[60] Brogent Technologies Inc.: http://www.brogent.com/brogentENG/eng/tech/video.htm.
[61] B. Girod, E. Steinbach, and N. Färber, "Comparison of the H.263 and H.261 Video Compression
Standards," in Standards and Common Interfaces for Video Information Systems, 1995.
[62] F. Pereira and T. Ebrahimi, Eds., The MPEG-4 Book.: Prentice Hall, 2002.
[63] K. Panusopone and A. Luthra, "Performance Comparison of MPEG-4 and H.263+ for Streaming Video
Applications," Circuits Systems Signal Processing, vol. 20, no. 3, pp. 293-309, 2001.
[64] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264/AVC Video Coding
Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576,
July 2003.
[65] M.H. Pinson, S. Wolf, and G. Cermak, "HDTV Subjective Quality of H.264 vs. MPEG-2, with and without
Packet Loss," IEEE Transactions on Broadcasting, vol. 56, no. 1, pp. 86-91, March 2010.
[66] N. Kamaci and Y. Altunbasak, "Performance Comparison of the Emerging H.264 Video Coding Standard,"
IEEE International Conference on Multimedia and Expo (ICME), pp. 6-9, 2003.
[67] J. Dong, K.N. Ngan, C.K. Fong, and W.K. Cham, "2-D Order-16 Integer Transforms for HD Video
Coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 10, pp. 1462-
1474, October 2009.
132
[68] S. Ma and C.-C. Kuo, "High-definition Video Coding with Super-macroblocks," Proceedings SPIE Visual
Communications and Image Processing, vol. 6508, no. 16, pp. 1-12, January 2007.
[69] B. Zeng and J. Fu, "Directional Discrete Cosine Transforms - A New Framework for Image Coding," IEEE
Transactions on Circuits and Systems for Video Technology, vol. 18, no. 3, pp. 305-313, March 2008.
[70] B. Furht, K. Gustafson, H. Huang, and O. Marques, "An Adaptive Three-Dimensional DCT Compression
Based on Motion Analysis," Proceedings of the ACM Symposium on Applied Computing, pp. 765-768,
2003.
[71] W.J. Choi, S.Y. Jeon, C.B. Ahn, and S.J. Oh, "A Multi-Dimensional Transform for Future Video Coding,"
The 23rd International Technical Conference on Circuits/Systems, pp. 1601-1604, July 2008.