Image & Video Compression (19/09/2006)- 1 - Centre for Digital Video Processing C e n t r e f o r D I g I t a l V I d e o P r o c e s s I n g Image and

Image & Video Compression (19/09/2006) - 1 -

Centre for Digital Video Processing

C e n t r e f o r D I g I t a l V I d e o P r o c e s s I n g

Image and Video CompressionA presentation to Avocent

Noel O’Connor, Andrew Kinane, Daniel Larkin

19/09/2006




Overview

• Lossless Compression – Entropy coding: a brief review

• Huffman Coding• Arithmetic Coding

– Lossless Compression Standards• The FAX Group Standards, JBIG, Lossless JPEG

• Lossy Compression– Generic Codec Structure

• DCT/IDCT• Quantization• Motion Estimation• Motion Compensation

– Lossy Compression Standards• JPEG, JPEG2000, H.261 / H.263 / H.264, MPEG-1/-2/-4

• Image Analysis Techniques – Visual Feature Extraction




Lossless Compression

Entropy Coding




Entropy Coding

• Also referred to as source coding• Assign each symbol a binary codeword

– Allocate a specific string of bits to a symbol

• Based on information theory:– S = {s1 … sN} is set of symbols to encode

with probabilities p1 … pN

– Entropy H(s) is measure of the information content:

– Specifies lower bound on efficiency




Huffman Coding

• A form of Variable Length Coding:– Assign shorter code-words to symbols most

likely to occur, longer to those less likely

• Problem: must choose code-words carefully!– Must obey prefix condition so decoder can

parse bitstreamSequence s1, s4, s3, s2

Bitstream 1 0 1 0 0 1 1 0 1

Decoder

s1 s4 s3 s2

s1 s2 or s4?




Huffman Coding

• Ensures instantaneously parseable code-words

• 100% efficient when p1 … pN are negative exponents of 2 (0.5, 0.25, etc …)

• Algorithm: generate Huffman coding tree:– Form the tree:

• Sort the symbols by their probabilities• Merge the two smallest probabilities by adding them and produce a new node in the tree

• Repeat until only a singe node is reached– Assign bits:

• Traverse the tree from the root to the leaf nodes assigning each branch encountered a one or zero.

• Decoding based on storing codewords in specially constructed LUT




Huffman Coding

• Generate code-words for each grey level

• S = {s1 s2 s3 s4 s5} = {0,4,5,6,7}

• p1 p2 p3 p4 p5 = 0.125, 0.484, 0.25, 0.125, 0.016




Huffman Coding

• Generate code-words for each grey level

• S = {s1 s2 s3 s4 s5} = {0,4,5,6,7}

• p1 p2 p3 p4 p5 = 0.125, 0.484, 0.25, 0.125, 0.016




Huffman Coding

• Efficiency:– Calculate Average Coding Rate

• Symbol probability (pi) x code-word length (li)

– Compare to entropy H(s) R




Huffman Coding

• Problems:– Lower bound of 1 bit/symbol– Does not facilitate adaptive coding

• Example




Arithmetic Coding

• Treat groups of symbols … but maintain a symbol-by-symbol encoding mechanism

• Assign a single codeword to a group of symbols

• Codeword represents a half-open interval on [0.0, 1.0)

• By assigning enough precision bits, one interval can be distinguished from another

• Symbols with higher probabilities correspond to larger intervals, thereby requiring less precision bits




Arithmetic Coding

• S={a,b} p1 p2 = 1/3, 2/3• First symbol narrows

interval to that symbol’s range:– Subsequent symbols further

restrict the current interval.• Decoding reverses this:

– Receives number in [0.0, 1.0)

– Checks which symbol’s range contains this & decode symbol

– Since lower & upper bounds of symbol known, their effects on the encoded number can be reversed

– Gives, a new number …– REPEAT




0.0

1.0

Arithmetic Coding

• Incremental transmission• Example: message “BILL<space>GATES”

2

252572572

257216

2572167




Arithmetic Coding

• Can be performed very efficiently using 16/32 bit integer mathematics

• Bits are transmitted as they become available• Simplification: use the value 0.999 rather than 1.0

• In binary arithmetic this corresponds to 0.111…

• Only use fractional part => only need integers

• High initially stores 0xFFFF, whilst Low stores 0x0000

• For each symbol encoded, examine most significant bit of both High and Low:– If these bits are the same, output bit




Lossless Compression

Standards




ITU-T Facsimile

• ITU-T Rec. T4 (Group 3)• Targets scanned business documents:

– Binary images: white (1), black (0)

• Two modes: – Modified Huffman (MH):

• Run-length encoding is used to form runs of 1s and 0s for each line in the image;

• Huffman coding applied to these (run,symbol) pairs; • Different Huffman codes for runs of 1s and 0s;• A special end-of-line (EOL) symbol is encoded for error

detection purposes. – Modified Read (MR):

• Pixel values from the previous line used as predictors for current pixels to be encoded;

• Prediction residual is then encoded using Huffman coding.– MR mode is periodically interspersed with MH mode.




JBIG

• Joint Binary Image Experts Group (JBIG) developed jointly by ITU-T and ISO

• Targets bi-level images:– may be either business documents or grey-

scale images of natural scenes rendered as bi-level images.

• Uses adaptive arithmetic encoding:– Modeling step estimates probability of next

symbol based on a context consisting of local pixels;

– Probability is then used to drive the arithmetic encoder;

– JBIG can be applied to grey-scale images by treating each grey-level image plane as a bi-level image.




Lossless JPEG

• Joint Photographic Experts Group (JPEG) has a lossless image compression mode.

• Prediction for pixel to be encoded based on a context of previously encoded pixels: – Different ways for forming the prediction;– Method used encoded as side-information for each

scan line.

• To encode the prediction residual:– (length, magnitude) pair formed; – length indicates the number of bits used to encode

the magnitude:• A static Huffman code is used.

– magnitude is the actual residual value directly encoded.




Lossless JPEG

• p = 190• p1 = 184, p2 = 176• P = 180• R = 180-190= -10 • Encoded as the event (4,0101)

– Negative residuals encoded as 1s complement– Huffman code for 4 is 001, then this give the final

codeword “0010101”

• Decoder: – Calculates the prediction value (180)– Parses the Huffman code, which allows decoding of the

magnitude (0101)– Detects a leading zero => knows the value must be

negative, so next four bits decoded as -10. – Reconstruction: p=P-R= 180-(-10) = 190




Lossy Compression

Generic structure of a video codec




Redundancy in Video Sequences

• Video compression targets 3 kinds of redundancy:– Spatial: the correlation that exists between

(groups of) pixels;– Temporal: similarity between video frames;– Perceptual: Human Visual System (HVS) is

less sensitive to high-frequency information.

• Lossy compression throws information away as part of these processes

• Remaining information is encoded losslessly using entropy coding




Redundancy in Video Sequences

• Spatial redundancy:– Transform data to be encoded into a new

representation where data is less correlated;– Leads to a more compact representation.

• Temporal redundancy:– Only encode difference between 2 video frames

(lower entropy);– Form prediction of frame to be encoded and encode

prediction residual;

• Perceptual redundancy:– Suppress/remove high frequency components

corresponding to fine image detail.




Coding Modes

• INTRA:– Encode a frame completely independently

(i.e. with no reference to previous/future frames);

– Forms random access point in bitstream, resets encoding, limits error propagation;

– Equivalent to having a JPEG-encoded still image at periodic intervals in bitstream.

Frame 0

N Frames N Frames




Coding Modes

• INTER:– Use a previous/future frame (termed reference

frame) as the basis for a prediction of the current frame;

– Could just simply subtract reference frame from current frame;

– Or use a more sophisticated prediction method;– Need to use reconstructed frame as basis for

prediction so that encoder/decoder stay synchronised.

Frame 0 Frame 0




Coding Unit

• Break image/frame up into 16 x 16 “macro-blocks”:

• For YUV:– 4 8x8 luminance pixel blocks;– 2 8x8 chrominance pixel blocks.

• Coding decisions made on macro-block basis:– INTRA/INTER coding mode;– prediction method if INTER;– Loss introduced.

• Decisions flagged in bitstream syntax.




Generic Codec Structure




Discrete Cosine Transform (DCT)

• Why DCT?• What is it?• How does it work?• How is it computed (in reality)?• Adoption and variations• What about the DWT?• Quantisation




Why DCT?

• Neighbouring pixels are likely to be similar• The same is true for prediction residual data

• Want to exploit this spatial correlation• We want a transform that:

– Removes correlation from data – Packs signal energy into as few coefficients as possible

• Coefficients suitable for entropy coding




Why DCT?

• Optimal solution– Use eigenvectors of the covariance matrix of the input pixel data– Order based on size of eigenvalue– Based on theory of principal component analysis (PCA)– Referred to as the Karhunen-Loeve Transform (KLT) [rao90]

• Achieves complete de-correlation• Packs most energy into fewest coefficients• Minimises MSE for a given number of coefficients (Quantisation)• Minimises the entropy

– Disadvantages:• Very computationally demanding• Transform kernel is data dependent• Kernel must be sent to decoder also!• Not practical in a real compression system

• Compromise The DCT




What is the DCT?

• Treat frame as a grid of 8x8 pixel blocks– Pixel data (intra block)– Prediction Residual (inter block)

• Compute 8x8 2D DCT on each block• Formula:

• Basis functions derived using Fourier theory

otherwise 1

0,for 2

1

16

)12(cos

16

)12(cos),(

4

1),(

7

0

7

0

vuCC

vyuxyxfCCvuF

vu

x yvu




What is the DCT?

• Fourier’s theorem and the Nyquist sampling criterion mean only certain discrete frequencies can be present in an 8x8 block of sampled data.

• DCT coefficients tell us “how much” of a particular frequency is present in a particular block– Very crude explanation!

• Inverse DCT (IDCT) reverses this process– Essentially Fourier synthesis

otherwise 1

0,for 2

1

16

)12(cos

16

)12(cos),(

4

1),(

7

0

7

0

vuCC

vyuxvuFCCyxf

vu

u vvu




How does the DCT work?

• DCT does not compress anything in isolation!• This is achieved by quantiser and entropy coding• DCT output easier to compress though• Most natural video dominated by low frequencies





• Human eye less sensitive to high frequencies– Use a quantiser whose step size depends on frequency– Effectively discard perceptually unimportant data– After quantisation there will be many zero valued coeffs

• Typically only 5 or 6 non-zero valued coeffs [xanthopoulos99]

• Suitable for run length and entropy coding





• Zig-zag scan– Keep statistically related coeffs together– Better run-length coding




How is the DCT Computed?

• Most implementations exploit the fact that the 2D DCT is separable– Compute 1D DCT on each column– Compute 1D DCT on each resultant row– 16 x 1D 8-point DCTs in total

• Need efficient implementation of 1D 8-point DCT– 30 years of research in this field– Basic implementation (64* 56+)– Fast implementation [loeffler89] (11* 29+)– Video codec optimised implementation “AAN” [arai89] (5* 29+)– Arithmetic precision a vital decision

• If constraint is 1920x1080 @ 30Hz– 97200 8x8 blocks per second– Need at least (17x106* 45x106+) per second using Loeffler!




How is the DCT Computed?

• Sometimes dedicated hardware needed– Performance and/or power reasons

• Hardware architecture taxonomy

DistributedArithmetic

SystolicArray

Recursive CORDICApproxBased

IntegerEncoding

ROMBased

AdderBased

HardwareSoftware

FastAlgorithm

DCT Implementation




Adoption and Variations

• 8x8 DCT– Used in JPEG, H.261, H.263, MPEG-1, MPEG-2, MPEG-4 with

specific quality requirements

• Shape Adaptive DCT– Used in MPEG-4 Advanced Coding Efficiency (ACE) profile– Kernel basis functions determined by object shape

• Integer DCT Approximation– Used in H.264– Block size of 4x4 and 8x8 depending on mode– Avoids the “IDCT mismatch” problem– Less computationally demanding (16bit integer arith)– More features (can discuss later if necessary)




What about the DWT

• Discrete Wavelet Transform (DWT)• Used by JPEG-2000• MPEG-4 uses SA-DWT (for static shape textures)• Why? “Better than Fourier analysis for non-stationary data”• Inherently scalable

– Involves successive LPF and HPF of data and subsampling

• More efficient at very low bit rates– DCT and coarse Q Blocking artefacts– DWT and coarse Q Blurring/smearing (much less perceptible)

• More computationally demanding than DCT




What is Quantisation?

• A lossy process• Get rid of information

– Gives compression gain– Try to minimise distortion– Try to reduce entropy

• Two primary types– Scalar quantiser (one to one)– Vector quantiser (many to one)




Scalar Quantiser

• Need to find optimal values for– Decision levels di

– Reconstruction levels ri

• Difficult in general!




Scalar Quantiser

• Aim to mimimise distortion– Minimise MSE Lloyd-Max quantiser

• A good quantiser design depends on probability distribution of the input data– Want less error for more probable inputs

• Case 1: Uniform distribution– Decision bands all same width – Reconstruction levels equally spaced– Referred to as a “linear quantiser”– Used frequently for simplicity

ii dd 1

2

ii dr




Scalar Quantiser

• Case 2: Piecewise constant distribution– Used when # of decision levels N is large– Decision level solution difficult (Use numerical methods for

Lagrange multipliers)– Reconstruction levels

21 ii

i

ddr




Scalar Quantiser

• Case 3: Nonuniform distribution– Need numerical methods for di and ri

– Tables available for standard distributions (Gaussian, Laplacian, Rayleigh,…) for popular N

– This is a true Lloyd-Max quantiser (or optimum mean square quantiser)

• Case 4: Uniform quantiser– Uniform refers to equal spacing between

decision levels regardless of distribution– Similar structure to ‘Case 1’ but different

performance because distribution not uniform– Commonly used (e.g in JPEG,…)




• MSE correlates well with subjective degradation• Don’t rely on MSE minimisation in isolation though• Need to consider overall rate-distortion

– Measures MSE as a function of number of bits n

– Constants a and b depend on distribution– When designing a quantiser for each DCT coefficient i need

to know ni

– 64 quantisers:

• How to determine ni (number of bits per coefficient)?– Depends on variance of coefficient i relative to others and

specified average bitrate nav

– Bit allocation algorithm paradigm

bnanf 2)(

Scalar Quantiser Performance

630 ,2)( ianf ibni




Bit allocation algorithms

• Try to keep constant• As variance increases, distortion decreases by

using more bits• Optimal allocation for N coefficients

• Often a rate controller after entropy encoder with feedback path to quantiser

)()( 2iiii nfnD

NN

jj

iavi bnn 1

1

0

2

2

2log1




Scalar Quantiser Summary

• Uniform quantiser most commonly used• In fact, rather than transmitting a

quantised coefficient, usually transmit the quantisation index

• This has much lower entropy

),(

),(),(

vu

vuFvuI




Vector Quantiser

• Quantise blocks of samples together– Each block assigned a single code

• A code book used to find code for block• Code book can be dynamic or pre-defined• Each pattern has specific encoding• Can give very good performance• Quite computationally expensive• Difficult to design tables• Used by GIF standard




Demo

Compression gain

Perceptual quality




Motion Estimation & Compensation

• Exploiting temporal redundancy• Motion Estimation

– Block matching algorithm overview• Matching Criteria • Selection of Search Strategies

• More advanced motion estimation techniques

• Software / Hardware Considerations• Motion Compensation• Adoption in standards discussed later




Exploiting Temporal RedundancyA) Frame number 1 B) Frame number 2

C) Residual = frame1 - frame2 D) Scaled residual (ease of viewing)

• Very slight change between successive frames (e.g A & B)

• Camera & Object Motion• Temporal prediction model at

encoder & decoder provides compression if:– model parameters + correction

terms < raw pixel information

• e.g. Frame differencing (C)– Entropy

• B = 7.15 bits/pixels• C = 4.38 bits/pixels

• More complex models can reduce entropy further– Computational expense, memory and prediction performance trade off

• Temporal Prediction model– Motion estimation– Motion compensation




Taxonomy of Motion Estimation Algorithms

• Good Motion Estimation reviews: [Mitchell96][Furht97][Kuhn99]

Motion Estimation Algorithms

Time Domain Frequency Domain

Gradient decent algorithms Matching Algorithms

pel recursiveblockrecursive

Wavelet basedmatching

Phasecorrelation

DCT basedmatching

Feature MatchingBlock Matching

Search Strategy Matching Criteria

Block Subsampling/Hierarchical

Prediction

Other Issues

Block Size Number ofreference frames

Optimisations

Rate / distortion Complexity / distortionFixed Variable

Mean Squared Error

Mean Absolute Error

Sum of absolute difference

Binary Block Matching

SAD summation truncation

SAD estimation

Reduced Bit Mean Absolute Difference

Minimised Maximum Error function

Pixel Difference Classification

Different Pixel Count

Adaptive Bit Truncation

Mean Absolute Difference of Means

Search spacereduction

Fast heuristicsearch strategies




Block Matching Algorithm

• For each MxN block in the current frame, find the associated best matching block within a predetermined or adaptive ±S pel search range in a reference frame(s)

– Estimates motion of a group of pixels – Assumes translational motion only– Typically operates on luminance component only– Good trade off between computationally complexity & prediction accuracy

• Motion vector (relative offsets to the best match) undergoes VLC• Prediction Residual undergoes further processing (DCT, VLC, etc)




• At each MxN block search position a matching criteria evaluated• Wide variety of matching criteria:

– Mean Squared Error:

– Mean Absolute Differences:

– Sum of Absolute Differences:

• Reduced complexity matching criteria– Binary Block Match:

• Others – Cross correlation– SAD summation truncation– SAD estimation – Reduced Bit Mean Absolute Difference – Minimised Maximum Error function– Etc

• Matching criteria is a complexity/prediction performance trade off

M

i

N

jrefcurr jiBjiBSAD

1 1

,,

M

i

N

jrefcurr jiBjiB

NMMAD

1 1

,,1

2

1 1

,,1

M

i

N

jrefcurr jiBjiB

NMMSE

M

i

N

jrefcurr jiBjiBBBM

1 1

,,

Matching Criteria




Search Strategies (1/4)• Many possible search strategies! • Full Search: search every position

• Best results, but very computationally expensive• Operations required to generate 1 MV for 1 current block:

– (2S+1)2 block matches – For each pixel in a M * N block match: subtract, absolute, accumulate– After each block match, minimum SAD comparison– Therefore total operations:

» (2S+1)2 * (M * N * 3 + 1), e.g. s=8, 289 * (M * N * 3 + 1)

• Reduce computational expense – Logarithmic: reduces number of search positions

• Assumes matching criteria monotonically increases moving away from minimum point – iteratively converge to minimum point

– Possibility of getting stuck in local minimum» Yields higher energy prediction residual

• Pseudocode for the Three Step Search– 1: R = 2*(log2S-1); – 2: Search positions within the search window defined using R– 3: R = R/2; – 4: if R<1 finished, else repeat go to 2.




Search Strategies (2/4)

• Logarithmic searches contd.– Three Step Search [Koga81]

• S = 8, initial R=4• Search positions defined using R:

– (x-R,y-R), (x,y-R), (x+R,y-R) ….(x,y),…(x+R,y+R)• Operations required to generate 1 MV

– (9+8+8) * (M * N * 3 + 1)

– Variants: • 2-D logarithmic [Jain81], Parallel 1-D [Chen91],

CDS [Rao83], N3SS [Li94], 4SS [Po96]

• Hierarchical Search Strategies– Search fewer positions & use fewer pixels in the matching criteria

• Achieved via sub-sampling current & reference frames• Disadvantage: increased memory

– Best match in lower resolution seeds search for subsequent resolutions– Can help to avoid local minima due to low pass filtering effect– Local minima still possible for small regions which disappear during sub-sampling





• 3 Level Hierarchical Search Example:– Level 1: Original – Sub-sampled by factor of 2 generating level 2 – Level 1 sub-sampled by 4 generating level 3 – Motion Estimation starts at level 3

• block size: N/4 X M/4• Search window ±S/4 • FS or TSS employed within this window• Produces motion vector (Vx3, Vy3)

– Motion Estimation level 2• block size: N/2 X M/2 • Centered on (x/2+2*Vx3, y/2+2*Vy3)• Search window ±1 around this point• Produces motion vector (Vx2, Vy2)

– Motion Estimation level 1• Centered on (x+2*Vx2, y+2*Vy2)• Search window ±1 around this point• Produces final motion vector (Vx1, Vy1)

• Operations required to generate 1 MV using a FS at level 3• (2*(S/4)+1)2 *(M/4 * N/4 * 3 + 1) + 9*(M/2 * N/4 * 3 + 1) + 9*(M*N* 3 + 1)





• Scene adaptive search area– Zone based search strategies

• Can employ stopping threshold in each zone• Advantageous in a rate/distortion sense• [chan95][Jung96][Zhe97]

– Spiral Search– Dynamic search window size

• Many techniques used to adjust range:– Spatial correlation of MV [Chain95][In97]

– Gradient based methods• Block based gradient decent search [Liu96]

– Stops after 4 steps

• Diamond search [Cote97]

• Early stopping technique– Skip to next block match when the minimum SAD has

been exceeded– Successive elimination algorithm [Li95]– Conservative block SAD [Do98]

Spiral search based Motion Estimation

Zone-based Motion Estimation




Different Search Strategy Performance*

• Frame Differencing– “0” Motion Vector– Entropy: 4.38 bits/pixel– 1 operation/pixel (subtraction)

• Full Search– Block size 16x16– Search range ±8– Entropy: 2.61 bits/pixel– ~868 operations/pixel

• Hierarchical Search– Block size 4x4, 8x8, 16x16– Search window ±2,±4, ±8, – Entropy: 3.08 bits/pixel– ~39 operations/pixel

• Hierarchical Search– Block size 4x4, 16x16, 32x32– Search window ±2, ±4, ±8– Entropy: 2.91 bits/pixel– ~35 operations/pixel




More advanced techniques (1/2)

FrameBoundary

• Bi-directional (Forward and Reverse) Prediction– Termed B-frames– Not feasible for real-time systems

• Multiple Reference Frames– Improves prediction– Increases computational expense & memory requirements

• Unrestricted Motion Vectors– Allow block matches outside the reference frame– Pixel padding used to extend beyond frame boundaries

• Predictive Motion Vectors – Rather than start at collocated block use a MV predictor

• Temporal and/or Spatial prediction [Lee97][Kos97][Zheng97]• Can improve prediction residual quality• Can employ thresholds to “gate-off” motion estimation• H/W: Reduces pixel reusability between current block positions

• Global Motion Compensation – “Default motion” for the frame/object

ORIGINALSEARCH WINDOW

PREDICTEDSEARCH WINDOW

MV PREDICTOR




More advanced techniques (2/2)

• Sub-pel Motion Estimation– Real motion is not constrained by

integer pixel amounts– Half-pel & quarter pel frequently used– But memory increases– H.264:

• 6-tap FIR filter for ½ pel • Bilinear for ¼ pel

• Variable Block Size Motion– Smaller block size will lead to smaller residual– But number of motion vectors & signalling info increases

• 41 MV per 16x16 block in H.264

– MPEG-4 & H.263 Advanced Prediction Motion Estimation (4MV)– H.264:

• Dynamically adapts between multiple block sizes (16x16, 16x8, 8x16, 8x8, 8x4, 4x8, 4x4)

• Rate/Distortion Optimised

• Motion Vector Coding Prediction – Adding MVs to bitstream can be costly, particularly if block size

is small– DPCM used to exploit spatial MV redundancies

16x16 block 8x16 blocks 16x8 blocks

8x8 blocks8x4

block

4x8 blocks 4x4 blocks




ME Software/Hardware considerations

• Software algorithmic complexity (simplified analysis)– To support 1920x1280 = 9600 x 30 = 288K 16x16 blocks/sec– ±8 Search Window = 289 Block matches per current block– Total block matches: 289 * 288K = 83,232,000 matches/sec– Operations = 83,232,000 * (256 pixels*3+1) ~= 6.4 GOPS

• Hardware implementations can be attractive– Systolic Array (1D/2D) approaches typically employed

• Memory bandwidth efficient & high throughput

• Full Search commonly used– Architectures also available for heuristic search strategies

• Architectures for H.264 Variable Block Size emerging– Ball park figures for H.264 VBSME core:

• 1-D 16 PE SA: – Area: 40-60K gates; Memory Bandwidth: ~3 pixels per clock cycle– 1 16x16 block match every 4096 clock cycles (±8 search range)

• 2-D 256 PE SA:– Area: 100-200K gates; Memory Bandwidth: ~48 pixels per clock cycle– 1 16x16 block match every 256 clock cycles (±8 search range)

• To support 1920x1280: 9600 x 30 = 288K 16x16 blocks/sec– 256 PE 2D SA requires a clock frequency ~= 75Mhz – For higher throughput: Arrays of 1-D/2-D modules required




Motion Compensation

• Straightforward relative to motion estimation– Reconstructed MB = Residual + Mot. Comp. MB (pointed to by MVs)

• Copy block of pixels from displaced block in the reference frame into the current frame– Reference frame must be stored in decoder– For encoder and decoder to remain synchronised

• Encoder also needs to do motion compensation

• Considerations:– Additional frame memory at the decoder– Low computational requirements




Lossy Compression

Standards




Standards Evolution

1984 1986 1988 1990 19961992 1994 1998 2000 2002 2004

JPEGJPEG2000

MPEG-1 MPEG-4

H.262/MPEG-2

H.261

H.26L(H.264 / MPEG-4v10)

H.263 H.263+ H.263++ITU

standards

ITU / MPEGstandards

MPEGstandards

JPEGstandards




JPEG

• Flexible image coding standard• 4 Modes of operation

– Lossless encoding (earlier)– Baseline sequential encoding– Progressive encoding– Hierarchical encoding (towards JPEG-2000)

• Motion JPEG– Baseline encoding of each frame– No motion estimation– Not properly standardised




JPEG-2000

• JPEG not optimised for a wide range of apps• JPEG-2000 even more flexible• Interesting features:

– Uses DWT instead of DCT– Region of Interest (ROI) coding– Scalability

• Spatial scalability• SNR scalability

– More resilient to channel errors• Individual quality packets independently decoded

– Also supports lossless coding

• Added flexibility comes at computational cost




JPEG/JPEG-2000 Summary

• JPEG capable of average compression of 15:1 for subjectively transparent quality

• JPEG-2000 better compression @ fixed rate– For ‘Foreman’:

• Gain of 1.54 dB for range of 1.20.12 bpp

• Applications– Internet– Digital photography– Many more




ITU-T H.261

• ITU-T: narrow bandwidth real-time apps• H.261 (p x 64)Kb/s over ISDN (1≤p≤30)• CIF and QCIF resolution• Real time video telephony/conferencing• Up to 3 frames interpolated by decoder

– Supports framerates of 30Hz, 15Hz, 10Hz, 7.5Hz• Video compression tools

– 8x8 DCT– Uniform scalar quantiser (rate control optional)– Entropy coder is modified run length and Huffman– Motion Estimation

• Only forward direction• Search window limited to ±15• Integer pixel accuracy only

– Motion Compensation is optional– Loop filter (alleviate blocking)




ISO/IEC MPEG-1

• Storage of AV content for delivery at ~1.5Mb/s• Flexible

– Resolutions typically ≤768x586– Framerate typically ≤30Hz

• H.261 was starting point for the standard• Compression gain at expense of latency• Specific features

– Standard VLCs determined by Huffman coding– DCT DC coeffs are differentially predicted– Bi-directional prediction (I,P,B frames)– Motion compensation with half-pixel accuracy– Maximum MV range of (-512,+511.5) for half pixel and

(-1024,+1023) for integer pixel– Weighted quantisation (H.261 does not have this)– Random access to bitstream, FF, FR




ISO/IEC MPEG-1




ISO/IEC MPEG-2

• High quality video @ 4-15Mb/s– VOD, Broadcast TV, DVD, HDTV, Satellite TV

• Major differences w.r.t. MPEG-1– More resolutions, framerates, qualities and bitrates

• SIF (352x288@25Hz) HDTV (1920x1250@60Hz)• Profiles and levels

– Has interlaced/progressive option• Frame/Field based ME, MC and DCT

– Scalability (temporal, spatial, SNR)

• Minor differences– More bits for quantisation– Alternate scan (as well as zigzag)




ITU-T H.263

• Very low bitrate apps (< 64kb/s)– Video telephony over PSTN, mobile telephony– Recommended resolutions: subQCIF, QCIF, CIF, 4CIF, 16CIF– Non-interlaced @ 29.97Hz

• Similar to H.261• Extensions (Some optional in Annex but included in H.264)

– MVs differentially encoded– Half-pixel accurate motion estimation

• Extensions support quarter and one eighth– Unrestricted motion vector mode

• MVs can point outside image, edge pixels form prediction– Advanced prediction mode

• MB can have 4 MVs associated with it– Syntax-based arithmetic encoding (SAC)

• Optional mode to replace VLCs with arithmetic encoding– “PB” frames– Error resilience

• Synchronisation markers• Reversible VLCs• More suggested in technical annex to standard




ISO/IEC MPEG-4

• An all encompassing standard!– Improved compression at 5kb/s 1Gb/s– Resolutions of sub-QCIF to studio– Content-based interactivity (semantic ‘objects’)– Universal access (scalability, error resilience)– Synthetic and natural hybrid coding (SNHC)




ISO/IEC MPEG-4

SA-DCT Quantiser

InverseQuantiser

SA-IDCT

EntropyEncoder

FrameMemory

MotionCompensation

MotionEstimation

+

-

+

BitstreamVideo In

Shape Coder

Shape In

Shape Decoder

PredictionResidual

Prediction

Current Frame

Current Frame Shape

Reference Frame

DecodedPredictionResidual

Reconstruction

Motion Vectors

DecodedCurrent Frame Shape




ISO/IEC MPEG-4

• Video coding tools– Integer, half and quarter pixel ME– Boundary MB ME: padding or polygon matching– Global ME– Shape Adaptive DCT– AC/DC intra prediction– Enhanced scalability: FGS– Still texture coding (uses SA-DWT)

• Shape Coding tools– Context-based arithmetic encoding (CAE)

• Compute context• Index into LUT for probability of 0,1• Drive arithmetic encoder




ITU-T H.264 or ISO/IEC MPEG-4 Part 10 (AVC)

• Targets enhanced compression for wide range of apps• Improved prediction

– Variable block-size MC with small block sizes– Up to quarter-pixel MC– Unrestricted motion vector mode– Multiple reference picture MC– Weighted prediction (generalised B-pictures)– Directional intra prediction (9 4x4 modes, 1 16x16 mode)– In the loop adaptive deblocking filter

• Improved coding efficiency tools– Small block size transform– Hierarchical block transform– Short word length transform (16 bit integer arith)– Exact match inverse transform– CAVLC, CABAC

• Enhanced error robustness and network friendliness









• H.264 Version 1 has 3 profiles– Baseline– Main– Extended

• Fidelity Range Extension (FRExt) Amendment– High Profile– High 10 Profile– High 4:2:2 Profile– High 4:4:4 Profile

• Up to 12 bits per sample• Supports lossless region coding• Codes RGB to avoid colour space transformation error




Comparing Standards

• Video conferencing applications– Low latency real-time requirement

– H.264/AVC MP would improve by further 10-20%• Using low delay bi-prediction, CABAC




Comparing Standards

• Video streaming applications– Less of delay constraint




Comparing Standards

• Entertainment-quality applications– High resolution, delay tolerable




Comparing Standards

• Professional motion picture production– Random access to individual frames

• Up to HDTV, H.264/AVC MP comparable or better than Motion-JPEG2000




Comparing Standards

• PSNR while good does not take into account intricacies of the human eye– Need subjective video tests– Other metrics

• MPQM,…

• Experiments show that H.264 gives lowest bitrate for subjectively equivalent video over a range of apps

• Improved performance comes at the cost of computational complexity– Main bottleneck is ME (very memory intensive)




Image Analysis

Visual Feature Extraction




Visual Features - Still Images

• What features are important?– Colour– Texture

• The feel, appearance, consistency of a surface

• In an image:

• Distribution over the entire image?

• Of specific parts of the image?

No texture Highly textured




Visual Features - Colour

• Colour is visually important to humans• Colour features and similarity metrics easy to

compute– Histogram [Swain and Ballard, 1992]

• Most commonly used structure to represent global image features.

• Invariant to translation and rotation and can be made invariant to scale by normalisation

• MPEG-7 Scalable Colour Description: – H(16 levels) S(4 levels) V(4 Levels) – histogram encoded

with a Haar transform for efficiency & scaling




Visual Features - Texture

• Simple texture descriptors [Pratt, 1991]: – Autocorrelation function– Co-occurrence matrices – Edge frequency – Primitive length

• More sophisticated (based on transforms and/or filtering)– Wavelet [Mallat, 1990], Haar [Theodoridis, 1999],

Gabor [Bovis, 1990]

• Others:– Mathematical morphology – Fractals




Visual Features - Texture

• Example: MPEG-7 Edge Histogram– Represents the global (and possibly local -

[Won, 2002]) spatial distribution of edges• Need to first generate edge map

– Roberts, Sobel and Prewitt, Canny, …

• Build histogram based on 5 edge types




Change Detection

• Compare 2 temporally adjacent images and determine how different they are

• Why?– Surveillance-type applications

• Assume static camera & background• Anything changing between one object and next must be

an object!• In fact, this is naïve but starting point of many object

segmentation techniques

– Temporal video structuring• Breaking video up into “chunks” for non-linear browsing:

shots, scenes, events, story-lines




Temporal Video Structuring

• Shot boundary detection

A set of keyframes

Keyframe-based video browsers

a video document







Temporal Video Structuring

• Shot boundary detection– A shot is a continuous piece of video taken with one

camera– A shot cut is the abrupt or gradual transition between two

shots

• Uncompressed domain:– Calculate colour histogram for each frame– Calculate difference between histograms using suitable

metric: L1 (city-block), L2 (Euclidean), Mahanoblis, etc– Threshold

• Compressed domain:– Parse features directly from bitstream:

• E.g. use DCT coefficients for each frame to reconstruct approximation of image

• E.g. motion vectors for each pair of frame and detect changes in global statistics

Documents

Image & Video Compression (19/09/2006)- 1 - Centre for Digital Video Processing C e n t r e f o r D I g I t a l V I d e o P r o c e s s I n g Image and