Tutorial - Intercon 2014

Principles of Data Compression: Theory and Applications

Dr. Daniel Leon-Salas

Motivation

The Information Revolution

IEEE Intercon 2014, Arequipa

Motivation

• Consider a 3 minute song:– assuming two channels, a 16-bit resolution, a

sampling rate of 48 kHz, it will take 33 MB of disk space to store the song.

• Consider a 5 megapixel camera:– assuming an 8-bit resolution per pixel, it will take

5 MB of disk space to store one picture.

• One second of video using the CCIR 601 format (720×485) needs more than 30 megabytes of storage space.


Introduction

• If data generation is growing at an explosive rate, why not focus on improving transmission and storage technologies?

• Transmission and storage technologies are improving but not at the same rate as data is generated.

• This is especially true for wireless communications where the radio spectrum is limited.


Introduction

• Data compression is the art or science of representing information in a compact form.

• Data compression is performed by identifying and exploiting structure and redundancies in the data.

• Data can be samples of audio, images, text files, it can be generated by sensors or scientific instruments, social networks, markets, etc.


Introduction

• Consider Morse code, developed in the 19th

century, in which letters are encoded with dots and dashes. some letters (e and a) occur more often than others (q

and j).

letters that occur more frequently are encoded using shorter sequences: e . a .-

Letters that occur less frequently are encoded using longer sequences: q - - . - j .- - -

• In this case the statistical structure of the data was exploited.


Introduction

• There are many other types of structure in data that can be exploited to achieve compression.

• In speech, the physical structure of our vocal tract determines the kind of sounds that we can produce instead of sending speech samples we can send information about the vocal tract to the receiver.

• We can also exploit characteristics of the end user of the data.


Introduction

• In many cases, when transmitting images or audio, the end user is a human.

• Humans have limited hearing and vision abilities.

• We can exploit the limitations of human perception to discard irrelevant information and obtain higher compression.


Compression and Reconstruction


compressionreconstruction

(decompression)

Original Reconstructed

Compression Algorithm

Lossless Compression


• Lossless compression involves no loss of information.

• The recovered data is an exact copy of the original.

• Useful in applications that cannot tolerate any difference:

medical images scientific data financial records computer programs

Lossy Compression


• In lossy compression some loss of information is tolerated.

• The original data cannot be recovered exactly but results in higher compression ratios.

• Useful in applications where some loss of information is not critical:

speech coding telephone communications

video coding digital photography

Compression Performance


• Compression ratio (CR):

• Distortion (for lossy compression):

# bits required to represent data without compression

# bits required to represent data with compressionCR =

MSE =1

𝑁𝑋 − 𝑋

2

2

PSNR dB = 10 log10𝑋max2

MSE

• Rate: average number of bits per sample or symbol

Example 1


Let’s consider the following input sequence:

𝑋 = [9, 11, 11, 11, 14, 13, 15, 17, 16, 17, 20, 21]

To encode this sequence using plain binary code, we would need to use 5 bits per number and a total of 60 bits.

K. Sayood, Introduction to Data Compression, 2nd edition, Morgan Kauffman

Example 1


If we use the model:

𝑋 𝑛 = 𝑛 + 8

The residual 𝑒 consists of only three numbers {−1, 0, 1} which can be encoded using 2 bits per number for a total 36 bits.

and compute the residual 𝑒 = 𝑋 − 𝑋 = [0, 1, 0, −1, 1,−1, 0, 1, −1,−1, 1, 1]

Example 2


• Input sequence: a_barayaran_array_ran_far_faar_faaar_away

• The sequence is made of eight different characters (symbols):

a, b, f, n, r, w, y, _

• Hence, we can use three bits per symbol to encode the sequence resulting in a total of 41×3=123 bits for the entire sequence.

• However, we can use fewer bits if we realize that some symbols occur more frequently than others.

• We can use fewer bits to encode the more frequent symbols.

K. Sayood, Introduction to Data Compression, 2nd edition, Morgan Kauffman

Example 2


Using variable-length codes we can encode the sequence using only 97 bits.

Input character Frequency Variable-length code Fixed-length code

a 16 1 000

_ 7 001 001

b 1 01100 010

f 3 0100 011

n 2 0111 100

r 6 000 101

w 1 01101 110

y 3 0101 111

Input sequence: a_barayaran_array_ran_far_faar_faaar_away

codes

codewords

Statistical Redundancy


• Statistical redundancy was employed in Example 2 to build a code to encode the input sequence.

• When compressing text, statistical redundancy can be extended to, not only characters, but also words dictionary technique.

• Examples of compression solutions that use the dictionary technique include the Lempel-Ziv (LZ) algorithm, LZ77, gzip, Zip, PNG, PKZip.

Information and Entropy


• Information can be defined as a message that helps to resolve uncertainty.

• In Information Theory information is taken as a sequence of symbols from an alphabet.

• Entropy is a measure of information.

source

A{a1, a2 … an}

a1 a2 a3 a6 a8 a5 a3 a4

symbols

messagealphabet

𝐻 𝐴 = −

𝑖=1

𝑛

𝑃(𝑎𝑖) log𝑃(𝑎𝑖)

First-order entropy of the source:

Entropy


• If the base of the logarithm is 2 the units of entropy are bits. If the base is 10 the units are hartleys. If the base is e the units are nats.

• The first-order entropy assumes that the symbols occur independently of each other.

• The entropy is a measure of the average number of bits needed to encoded the output of the source.

• Claude Shannon showed that the best rate that a lossless compression algorithm can achieve is equal to the entropy of the source.

• Example: Let’s consider a source with an alphabet consisting of four symbols: a1, a2, a3, a4.

P(a1) = 1/2, P(a2) = 1/4, P(a3) = 1/8, P(a4) = 1/8

H = -(1/2 log2(1/2) + 1/4 log2(1/4) + 1/8 log2(1/8) + 1/8 log2(1/8)) = 1.75 bits/symbol.

𝐻 𝐴 = −

𝑖=1

𝑛

𝑃(𝑎𝑖) log𝑃(𝑎𝑖)

Coding


• Coding is the process of assigning binary sequences to symbols of an alphabet.

• Example: Let’s consider a source with a four-symbol alphabet such that: P(a1) = 1/2,

P(a2) = 1/4, P(a3) = 1/8, P(a4) = 1/8 H = 1.75 bits/symbol.

Symbol Probability Code 1 Code 2 Code 3 Code 4

a1 0.5 0 0 0 0

a2 0.25 0 1 10 01

a3 0.125 1 00 110 011

a4 0.125 10 11 111 0111

Average length 1.125 bits 1.25 bits 1.75 bits 1.875 bits

uniquely decodable

codes

Prefix Codes


k bits

C1

n bits

C2

Consider the following codewords:

IF

n bits

C2

k bits

C1

dangling suffix

then we say that C1 is a prefix of C2

• If the dangling suffix is itself a codeword, the code is not uniquely decodable.

• A prefix code is a code in which no codeword is a prefix of another codeword.

• Prefix codes are uniquely decodable.

Huffman Coding


• Huffman coding is an algorithm for building optimum prefix codes.

• It was developed as a class assignment in the first class on information theory taught by Robert Fano at MIT in 1950.

• Huffman coding assumes that the probabilities of the source are known.

• Huffman coding is based on the following observations about optimum prefix codes: Symbols with higher probability have shorter codewords than

less probable symbols. The two symbols with the lowest probabilities have the same

length (proof by contradiction) In a Huffman code the codewords corresponding to the two

symbols with the lowest probabilities differ only in the last bit.

Huffman Coding


Example: Let’s build a Huffman code for a source with a four-symbol alphabet such that: (a1) = 0.5, P(a2) = 0.25, P(a3) = 0.125, P(a4) = 0.125

a1 a2 a3 a4

0.5 0.25 0.125 0.125

a1 a2

a3 a4

0.5 0.25 0.25

1

210

Huffman Coding


a1 a2

a3 a4

0.5 0.25 0.25

2

a1

a2

a3 a4

3

0.250.25

0.5 0.5

10

10

10

Huffman Coding


a1

a2

a3 a4

4

0.250.25

0.5 0.5

0.125 0.125

1.0

10

10

10

Symbol Probability Codeword

a1 0.5 0

a2 0.25 10

a3 0.125 110

a4 0.125 111

Average codeword length:lavg = 0.5×1 + 0.25×2 + 0.125×3 +

0.125×3 = 1.75 bits

It can be shown that for Huffman codes:

H(S) ≤ lavg ≤ H(S)+1

Decoding Huffman Codes


a1

a2

a3 a4

10

10

10

Example: Decode the following message using the Huffman code from previous example: 0110101110

0110101110

0110101110

0110101110

0110101110

0110101110

a1

a1 a3

a1 a3 a2

a1 a3 a2 a4

a1 a3 a2 a4 a1

Decoded messageEncoded message

Adaptive Huffman Codes


• Huffman coding requires knowledge of the probabilities of the source.• If this knowledge is not available, Huffman coding becomes a two-pass

procedure: first pass to compute the probabilities second pass to encode the output of the source.

• The adaptive Huffman coding algorithm converts this two-pass procedure into a single-pass procedure.

• In adaptive Huffman coding, the transmitter and the receiver start with a code tree that has a single node corresponding to all the symbols not yet transmitted (NYT).

• As transmission progresses, nodes corresponding to transmitted symbols are added to the tree.

• The first time a symbol is transmitted, the code for NYT is transmitted first followed by a non-adaptive code agreed by the transmitter and the receiver before transmission starts.

Golomb-Rice Codes


• The Golomb-Rice codes are a family of codes commonly used in data compression applications due to their low-complexity and good compression performance.

• The JPEG committee and the Consultative Committee for Space Data Systems (CCSDS), for instance, have adopted the Golomb-Rice codes as part of their standards.

• Golomb-Rices codes have also been recommended in the lossless audio compression standard H.264 and are already used in many commercial audio compression software.

• The Golomb-Rice codes have their origin in the pioneering work of Golomb who proposed a method to encode run lengths of events of a binary source when po

m=1/2, where po is the probability of events and mis an integer.

Golomb-Rice Codes


binary source

A{0, 1} 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 …

4 3 4 7 3 2po is the probability of a 1

(pom=1/2 where m is an integer) run lengths (non-negative integers)

. . . .0 1 2 3 4 5 6 7 8 9 10 11 12

n

P(n)

Geometric distribution

Golomb-Rice Codes


The Golomb-Rice codes consider the special case when m = 2k (k≥0)

k

n

2

k

n

2mod

unary code

natural binarycode

n

Example:n =17 (00010001)

k=0 codeword = 111111111111111110k=1 codeword = 1111111101k=2 codeword = 1111001k=3 codeword = 110001

b7 b6 b5 b4 b3 b2 b1 b0

k

b3 b2 b1 b00111111

unary code binary code

k=4 codeword = 100001k=5 codeword = 010001k=6 codeword = 0010001k=7 codeword = 00010001

Encoding Procedure:

Golomb-Rice Codes


P(n)

....0-1-2-3-4-5-6-7-8-9-10

. . . .1 2 3 4 5 6 7 8 9 10 11

n

Practical sources produce positive and negative numbers (double-sided distribution)

Use the following mapping:

M(n) =2n

2|n|−1

if n ≥ 0

if n < 0

Maps positive input numbers to even integers and negative input numbers to odd integers.

Adaptive Golomb-Rice Codes


M

codewordsource G-Rcoder

adaptivealgorithm

Adaptive Golomb-Rice Codes


1) Initialize k to kini;

2) Reset counter;

3) Read input n and encode it using parameter k;

4) If (unary code ≥ 1) increment counter;

5) If (unary code = 0) decrement counter;

6) If (counter value ≥ M) k++; Goto 2;

7) If (counter value ≤ -M) k--; Goto 2;

Entropy Coding


sourceentropyencoder

compressed output

n

P(n)

If the source has a narrow distribution, an entropy encoder (Huffman, Golomb-Rice, arithmetic) can be used directly

Otherwise, a decorrelation step might be necessary

sourceentropyencoder

compressed outputdecorrelation

predictive coding, transform coding,

subband coding

Predictive Coding Decorrelation


X

61 6358 69

6460

57 5955 63

X X = 64

3 20 6

4-1

2 2-2 4

eX

X ─ X prediction residual

pixel prediction

In an image, a pixel generally has a value

close to one of its neighbors

Predictive Coding Decorrelation


Histogram Histogram

Original Residual

Context Adaptive Lossless Image Compression (CALIC)


NNENN

NENNW

WWW X

Pixel neighborhood

𝑑ℎ = |𝑊 −𝑊𝑊| + 𝑁 − 𝑁𝑊 + |𝑁𝐸 − 𝑁|

𝑑𝑣 = |𝑊 −𝑁𝑊| + 𝑁 − 𝑁𝑁 + |𝑁𝑁𝐸 − 𝑁𝐸|

If 𝑑ℎ − 𝑑𝑣 > 80 𝑋 𝑁

else if 𝑑𝑣 − 𝑑ℎ > 80 𝑋 𝑊

else { 𝑋 𝑁 +𝑊 /2 + (𝑁𝐸 − 𝑁𝑊)/4

if 𝑑ℎ − 𝑑𝑣 > 32 𝑋 ( 𝑋 + 𝑁)/2

else if 𝑑𝑣 − 𝑑ℎ > 32 𝑋 ( 𝑋 + 𝑁)/2

else if 𝑑ℎ − 𝑑𝑣 > 8 𝑋 (3 𝑋 + 𝑁)/4

else if 𝑑𝑣 − 𝑑ℎ > 8 𝑋 (3 𝑋 +𝑊)/4

}

The neighboring pixels N, W,

NE, NW, NN, WW, NNE are available to both the encoder and the decoder (assuming a

raster scan)

To get an idea of the boundaries present in the neighborhood:

Initial pixel prediction:

1

2

3 The initial prediction is refined based on the relationships of the pixels in the neighborhood (contexts). For each context we keep track of how much prediction error is generated and use it to refine the initial prediction.

Transform Coding


• In transform coding the input sequence is transformed into another sequence in which most of the information is contained in only a few elements.

• For a 1D signal such as audio or speech, 𝐱, the forward transform is defined as:𝜃 = 𝐀𝐱

and the inverse transform is defined as:𝐱 = 𝐁𝜃

the transforms are orthonormal transforms: 𝐁 = 𝐀−𝟏 = 𝐀𝑇

• For 2D signals such as images, a two-dimensional separable transform is used. In a separable transform, we can take a 1D transform in one dimension and another 1D transform in the other dimension.

• In matrix notation:𝚯 = 𝐀𝐗𝐀𝑇

and the inverse transform is given by:𝐗 = 𝐁𝚯𝐁𝑇

Transform Coding


• In the JPEG standard, the forward transform is the Discrete Cosine Transform (DCT) and the inverse transform is the Inverse Discrete Cosine Transform (IDCT).

• The DCT transform matrix 𝐀 is defined as:

• 𝐀𝑖,𝑗 =

1

𝑁cos

2𝑗+1 𝑖𝜋

2𝑁𝑖 = 0, 𝑗 = 0,1,⋯ , 𝑁 − 1

2

𝑁cos

2𝑗+1 𝑖𝜋

2𝑁𝑖 = 1,2,⋯ , 𝑁 − 1, 𝑗 = 0,1,⋯ ,𝑁 − 1

DCT Quantization

DPCM

RLC

DC

AC

Entropy encoder

quantization table

compressed image

input image

Transform Coding - DCT


183 177 147 79 41 34 35 43

189 153 63 39 38 37 39 44

187 99 37 38 42 41 46 46

101 42 36 39 61 63 59 44

41 41 38 45 57 73 52 47

44 49 49 50 54 60 58 54

51 58 55 50 55 57 58 54

44 50 52 54 55 59 67 63

502.0 119.5 83.8 48.3 6.0 0.0 -0.1 -0.3

88.6 173.4 90.9 22.5 11.5 -1.8 -0.2 -0.8

62.0 78.7 22.2 -44.9 -19.8 -9.4 -7.3 -1.1

12.2 4.7 -37.1 -44.6 -30.2 -12.2 5.0 -3.0

3.5 -22.5 -36.9 -20.3 -13.0 4.1 11.5 5.1

12.1 9.7 -7.0 -6.6 2.6 11.3 8.5 11.5

9.2 7.9 3.7 -6.4 6.3 10.1 3.8 1.8

2.6 9.8 1.4 -2.0 0.3 -1.2 2.3 -5.1

DCT

8

8

AC coefficientsDC coefficient

Quantization of DCT Coefficients


502.0 119.5 83.8 48.3 6.0 0.0 -0.1 -0.3

88.6 173.4 90.9 22.5 11.5 -1.8 -0.2 -0.8

62.0 78.7 22.2 -44.9 -19.8 -9.4 -7.3 -1.1

12.2 4.7 -37.1 -44.6 -30.2 -12.2 5.0 -3.0

3.5 -22.5 -36.9 -20.3 -13.0 4.1 11.5 5.1

12.1 9.7 -7.0 -6.6 2.6 11.3 8.5 11.5

9.2 7.9 3.7 -6.4 6.3 10.1 3.8 1.8

2.6 9.8 1.4 -2.0 0.3 -1.2 2.3 -5.1

496 121 80 48 0 0 0 0

84 168 84 19 0 0 0 0

56 78 16 -48 0 0 0 0

14 0 -44 -58 -51 0 0 0

0 -22 -37 0 0 0 0 0

24 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

16 11 10 16 24 40 51 61

12 12 14 19 26 58 60 55

14 13 16 24 40 57 69 56

14 17 22 29 51 87 80 62

18 22 37 56 68 109 103 77

24 35 55 64 81 104 113 92

49 64 78 87 103 121 120 101

72 92 95 98 112 100 103 99

DCT coefficients

Quantization Table (𝐐)

Quantized coefficients

𝚯 = 𝐐 round𝚯

𝐐

After quantization the DCT coefficients are transmitted following a zig-zag pattern.

The coefficients are encoded using a Huffman

code.

Transform Coding - DCT


Original Coded using DCT

Sub-band Coding


• In sub-band coding the input signal is decomposed into several sub-bands using an analysis filter bank.

• Depending on the signal different sub-bands will contain different amounts of information.

• Sub-bands with lots of information are encoded using more bits while sub-bands with little information are encoded using fewer bits.

• At the decoder side, the signal is reconstructed using a bank of synthesis filter.

f1 f2 f3 fM

. . .

. . .

Subband Coding


analysis filter 1 M

entropy encoder 1

entropy decoder 1 M

synthesis filter 1

. . .

analysis filter 2 M

entropy encoder 2

entropy decoder 2 M

synthesis filter 2

. . .

analysis filter 3 M

entropy encoder 3

entropy decoder 3 M

synthesis filter 3

. . .

analysis filter M M

entropy encoder M

entropy decoder M M

synthesis filter M

. . .

outputinput

Further Reading


• Khalid Sayood, Introduction to Data Compression, 4th edition, Morgan Kaufmann, San Francisco, 2012.

• G. Held and T. R. Marshall, Data Compression, 3rd edition, John Wiley and Sons, New York, 1991.

• N. S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice Hall, Englewood Cliffs, 1984.

• B. E. Usevitch, “A tutorial on modern lossy wavelet image compression: foundations of JPEG 2000,” IEEE Signal Processing Magazine, vol. 18, no. 5, 2001.

• D. Pan, “Digital audio compression,” Digital Technical Journal, vol. 5, no. 2, 1993.

• M. Hans and R. W. Schafer, “Lossless compression of digital audio,” IEEE Signal Processing Magazine, vol. 18, no. 4, 2001.

• G. E. Blelloch, Introduction to Data Compression, course notes, Computer Science Department, Carnegie Mellon University

Documents

Tutorial - Intercon 2014