74
1 Chapter 3: Compression (Part 1)

1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

Embed Size (px)

Citation preview

Page 1: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

1

Chapter 3: Compression(Part 1)

Page 2: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

2

Text Image Audio Animation Video

Object Type

-ASCII -EBCDIC -BIG5 -GB

-Bitmapped graphics -Still photos -Fax

-stream of digitized audio or voice

-Synched image and music stream at 15-19 frames/s

-Digital video at 25-30 frames/s

Size and Bandwidth

2KB per page

-Simple (640x480): 1MB per image

Voice/Phone 8kHz/8 bits (mono) 8 KB/s Audio CD 44.1 kHz/16 bits (stereo) 176 KB/s

6.5 MB/s for 320x640x16 pixels/frame (16 bit color) 16 frames/s

29.7 MB/s for 720x480 pixels per frame 24-bit color 30 frame/s

Storage size for media types(uncompressed)

Page 3: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

3

A typical encyclopedia

50,000 pages of text (2KB/page) 0.1 GB 3,000 color pictures

(on average 64048024 bits = 1MB/picture) 3 GB 500 maps/drawings/illustrations

(on average 64048016 bits = 0.6 MB/map) 0.3 GB 60 minutes of stereo sound (176 KB/s) 0.6 GB 30 animations, on average 2 minutes in duration

(64032016 pixels/frame = 6.5 MB/s) 23.4 GB 50 digitized movies, on average 1 min. in duration

87 GB

TOTAL: 114 GB(or 180 CD’s, or about 2 feet high of CD’s)

Page 4: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

4

Compressing an encyclopedia

Text 1:2

Color Images 1:15

Maps 1:10

Stereo Sound 1:6

Animation 1:50

Motion Video 1:50

0.01

0.1

1

10

100

Tex

t

Co

lor

Imag

es

Map

s

Ste

reo

So

un

d

An

imat

ion

Vid

eo

Uncompressed

Compressed

114 GB -> 2.5 GB

GB

Page 5: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

5

Compression and encoding

We can view compression as an encoding method which transforms a sequence of bits/bytes into an artificial format such that the output (in general) takes less space than the input.

CompressorEncoder

CompressorEncoderFat dataFat data

Fat(redundancy)

Slim data

Page 6: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

6

Redundancy

Compression is made possible by exploiting redundancy in source data

For example, redundancy found in: Character distribution Character repetition High-usage pattern Positional redundancy

(these categories overlap to some extent)

Page 7: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

7

Symbols and Encoding

We assume a piece of source data that is subject to compression (such as text, image, etc.) is represented by a sequence of symbols. Each symbol is encoded in a computer by a code (or codeword, value), which is a bit string.

Example: English text: abc (symbols), ASCII (coding) Chinese text: 多媒體 (symbols), BIG5 (coding) Image: color (symbols), RGB (coding)

Page 8: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

8

Character distribution

Some symbols are used more frequently than others.

In English text, ‘e’ and space occur most often. Fixed-length encoding: use the same number of bits

to represent each symbol. With fixed-length encoding, to represent n symbols,

we need log2n bits for each code.

Page 9: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

9

Character distribution

Fixed-length encoding may not be the most space-efficient method to digitally represent an object.

Example: “abacabacabacabacabacabacabac” With fixed-length encoding, we need 2 bits for each of

the 3 symbols. Alternatively, we could use ‘1’ to represent ‘a’ and ‘00’

and ‘01’ to represent ‘b’ and ‘c’, respectively. How much saving did we achieve?

Page 10: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

10

Character distribution

Technique: whenever symbols occur with different frequencies, we use variable-length encoding.

For your interest: In 1838, Samuel Morse and Alfred Vail designed the

“Morse Code” for telegraph. Each letter of the alphabet is represented by dots (electric

current of short duration) and dashes (3 dots). e.g., E “.” Z “--..” Morse & Vail determined the code by counting the

letters in each bin of a printer’s type box.

Page 11: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

11

Morse code table

SOS = — — —

Page 12: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

12

Character distribution

Principle: use fewer bits to represent frequently occurring symbols, use more bits to represent those rarely occurring ones.

Page 13: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

13

Character repetition

when a string of repetitions of a single symbol occurs, e.g., “aaaaaaaaaaaaaabc”

Example: in graphics, it is common to have a line of pixels using the same color.

Technique: run-length encoding Instead of specifying the exact value of each repeating

symbol, we specify the value once and specify the number of times the value repeats.

Page 14: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

14

High usage patterns

Certain patterns (i.e., sequences of symbols) occur very often. Example: “qu”, “th”, ……, in English text. Example: “begin”, “end”, …, in Pascal source

code. Technique: Use a single special symbol to

represent a common sequence of symbols.e.g., = ‘th’ i.e., instead of using 2 codes for ‘t’ and ‘h’, use only 1

code for ‘th’

Page 15: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

15

Positional redundancy

Sometimes, given the context, symbols can be predicted.

Technique: differential encoding Predict a value and code only the prediction error. example: differential pulse code modulation (DPCM)

198, 199, 200, 200, 201, 200, 200, 198

198, +1, +1, 0, +1, -1, 0, -2

Small numbers require fewer bits to represent

Page 16: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

16

Categories of compression techniques

Entropy encoding regard data stream as a simple digital sequence, its

semantics is ignored lossless compression

Source encoding take into account the semantics of the data and the

characteristics of the specific medium to achieve a higher compression ratio

lossy compression

Page 17: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

17

Categories of compression techniques

Hybrid encoding A combination of the above two.

Page 18: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

18

A categorization

Entropy Encoding

Huffman Code, Shannon-Fano Code

Run-Length Encoding

LZW

Source Coding

Prediction DPCM

DM

Transformation DCT

Vector Quantization

Hybrid Coding

JPEG

MPEG

H.261

Page 19: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

19

Information theory

Entropy is a quantitative measure of uncertainty or randomness. Entropy order

According to Shannon, the entropy of an information source S is defined as:

where pi is the probability that symbol Si in S occurs. For notational convenience, we use a set of probabilities to

denote S. E.g., S = {0.2, 0.3, 0.5} for a set S of 3 symbols.

H S ppi

ii

( ) log 2

1

Page 20: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

20

Entropy

For any set S of n symbols 0 = H({1, 0, …, 0}) H(S) H(1/n, …, 1/n) =

log2n. That is, the entropy of S is bounded below and

above by 0 and log2n, respectively. The lower bound is obtained if one symbol

occurs with a probability of 1, while those of others are 0; The upper bound is obtained if all symbols occur with equal probability.

Page 21: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

21

Self-information

I(Si) = log2(1/pi) is called the self-information associated with the symbol Si.

It indicates the amount of information (surprise or unexpectedness) contained in Si or the number of bits needed to encode it.

Note that Entropy can be interpreted as the average self-information of a symbol of a source.

Page 22: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

22

Entropy and Compression

In relation to compression and coding, one can consider an object (such as an image, a piece of text, etc.) as an information source, and the elements in the object (such as pixels, alpha-numeric characters, etc.) as the symbols.

You may also think of the occurrence frequency of an element (e.g., a pixel) as the element’s occurrence probability.

Page 23: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

23

Example

Abracadabra!

Symbol Freq. Prob.

A 1 1/12

b 2 2/12

r 2 2/12

a 4 4/12

c 1 1/12

d 1 1/12

! 1 1/12

A piece of text

Correspondingstatistics

Page 24: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

24

Example

Symbol Freq. Prob.

27 0.21

44 0.34

28 0.22

25 0.20

4 0.03

An image

Correspondingstatistics

Page 25: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

25

Information theory

The entropy is the lower bound for lossless compression: if the probability of each symbol is fixed, then on average, a symbol requires H(S) bits to encode.

A guideline: the code lengths for different symbols should vary according to the information carried by the symbol. Coding techniques based on that are called entropy encoding.

Page 26: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

26

Information theory

For example, consider an image in which half of the pixels are white, another half are black p(white) = 0.5, p(black) = 0.5,

Information theory says that each pixel requires 1 bit to encode on average.

15.0

1log5.0)(

2

12

i

SH

Page 27: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

27

Information theory

Q: How about an image such that white occurs ¼ of the time, black ¼ of the time, and gray ½ of the time? Ans: Entropy = 1.5. Information theory says that it takes at least 1.5 bits on

average to encode a pixel How: white (00), black (01), gray (1)

Entropy encoding is about wisely using just enough number of bits to represent the occurrence of certain patterns.

Page 28: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

28

Variable-length encoding

A problem with variable-length encoding is ambiguity in the decoding process. e.g., 4 symbols: a (‘0’), b (‘1’), c (‘01’), d (‘10’) the bit stream: ‘0110’ can be interpreted as

abba, orabd, orcba, orcd

Page 29: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

29

Prefix code

We wish to encode each symbol into a sequence of 0’s and 1’s so that no code for a symbol is the prefix of the code of any other symbol – the prefix property.

With the prefix property, we are able to decode a sequence of 0’s and 1’s into a sequence of symbols unambiguously.

Page 30: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

30

Average code length

Given: a symbol set S = {S1, …, Sn}

with probability of Si = pi

a coding scheme C such that the codeword of Si is ci

and the length (in bits) of ci being li

The average code length (ACL) is given by

n

iiilpACL

1

Page 31: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

31

Average code length

Example: S = {a, b, c, d, e, f, g, h} with probabilities {0.1, 0.1, 0.3, 0.35, 0.05, 0.03, 0.02, 0.05} Coding scheme:

a b c d e f g h

100 101 00 01 110 11100 11101 1111

5.2

405.0502.0503.0305.0235.023.031.031.0

ACL

Page 32: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

32

Optimal code

An optimal code is one that has the smallest ACL for a given probability distribution of the symbols.

Page 33: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

33

Theorem

For any prefix binary coding scheme C for a symbol set S, we have

Entropy is thus a lower bound for lossless compression.

)(SHlpACL ii

Page 34: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

34

Shannon-Fano algorithm

Example: 5 characters are used in a text, statistics:

Algorithm (top-down approach):1. sort symbols according to their probabilities.2. divide symbols into 2 half-sets, each with roughly equal

counts.3. all symbols in the first group are given ‘0’ as the first bit of

their codes; ‘1’ for the second group.4. recursively perform steps 2-3 for each group with the

additional bits appended.

Symbol A B C D ECount 15 7 6 6 5

A

B

Page 35: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

35

Shannon-Fano algorithm

Example: G1: {A,B}; G2: {C,D,E} subdivide G1:

G11:{A}; G12{B} subdivide G2:

G21:{C}; G22:{D,E} subdivide G22:

G221:{D}; G222:{E}

Symbol A B C D E Code 0.. 0.. 1.. 1.. 1..

Symbol A B C D E

Code 00 01 1.. 1.. 1..

Symbol A B C D E Code 00 01 10 11. 11.

Symbol A B C D E

Code 00 01 10 110 111

Page 36: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

36

Shannon-Fano algorithm

Result:

Symbol Count log2(1/p) Code Bits used

A 15 1.38 00 30 B 7 2.48 01 14 C 6 2.70 10 12 D 6 2.70 110 18 E 5 2.96 111 15

39 89

ACL = 89/39 = 2.28 bits per symbolH(S) = 2.19

Page 37: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

37

Shannon-Fano algorithm

The coding scheme in a tree representation:

Total # of bits used for encoding is 89. A fixed-length coding scheme will need 3 bits for each

symbol a total of 117 bits.

15 7 6

6 5

11

39

22 17

A B C

D E

0 1

0 1 0 1

10

Page 38: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

38

Shannon-Fano algorithm

It can be shown that the ACL of a code generated by the Shannon-Fano algorithm is between H(S) and H(S) + 1. Hence it is fairly close to the theoretical optimal value.

Page 39: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

39

Huffman code

used in many compression programs, e.g., zip Algorithm (bottom-up approach):

Given n symbols We start with a forest of n trees. Each tree has only 1 node

corresponding to a symbol. Let’s define the weight of a tree = sum of the probabilities of

all the symbols in the tree. Repeat the following steps until the forest has one tree left:

pick 2 trees of the lightest weights, let’s say, T1, T2. replace T1 and T2 by a new tree T, such that T1 and T2 are T’s

children.

Page 40: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

40

Huffman code

a(15/39) b(7/39) e(5/39)c(6/39) d(6/39)

a(15/39) b(7/39) c(6/39)e(5/39)d(6/39)

11/39

a(15/39) e(5/39)d(6/39)

11/39

c(6/39)b(7/39)

13/39

Page 41: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

41

Huffman code

a(15/39) e(5/39)d(6/39)

11/39

c(6/39)b(7/39)

13/39

24/39

Page 42: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

42

Huffman code

result from the previous example:

Page 43: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

43

Huffman code

The code of a symbol is obtained by traversing the tree from the root to the symbol, taking a ‘0’ when a left branch is followed; and a ‘1’ when a right branch is followed.

Page 44: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

44

Huffman code

symbol code bits/symbol # occ. #bits

used

A 0 1 15 15

B 100 3 7 21

C 101 3 6 18

D 110 3 6 18

E 111 3 5 15

total: 87 bits

“ABAAAE” “0100000111”

Page 45: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

45

Huffman code

symbol code bits/symbol # occ. #bits

used

A 0 1 15 15

B 100 3 7 21

C 101 3 6 18

D 110 3 6 18

E 111 3 5 15

total: 87 bits

“ABAAAE” “0100000111”

Codingtable

Page 46: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

46

Huffman code (advantages)

Decoding is trivial as long as the coding table is stored before the (coded) data. Overhead (i.e., the coding table) is relatively negligible if the data file is big.

Prefix code.

Page 47: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

47

Huffman code (advantages)

“Optimal code size” (among any coding scheme that uses a fixed code for each symbol.) Entropy is

Average bits/symbol needed for Huffman coding is

( * . * . * . * . * . ) /

. /

.

15 138 7 2 48 6 2 7 6 2 7 5 2 96 39

85 26 39

2 19

87 39

2 23

/

.

Page 48: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

48

Huffman code (disadvantages)

Need to know the symbol frequency distribution. Might be ok for English text, but not data files Might need two passes over the data.

Frequency distribution may change over the content of the source.

Page 49: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

49

Run-length encoding (RLE)

Image, video and audio data streams often contain sequences of the same bytes or symbols, e.g., background in images, and silence in sound files.

Substantial compression can be achieved by replacing repeated symbol sequences with the number of occurrences of the symbol.

Page 50: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

50

Run-length encoding

For example, if a symbol occurs at least 4 consecutive times, the sequence can be encoded with the symbol followed by a special flag (i.e., a special symbol) and the number of its occurrences, e.g., Uncompressed: ABCCCCCCCCDEFGGGHI Compressed: ABC!8DEFGGGHI

If 4Cs (CCCC) are encoded as C!0, and 5Cs as C!1, and so on, then it allows compression of 4 to 259 symbols into 2 symbols + a count.

Page 51: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

51

Run-length encoding

Problems: Ineffective with text

except, perhaps, “Ahhhhhhhhhhhhhhhhh!”, “Noooooooooo!”. Need an escape byte (‘!’)

ok with text not ok with data solution?

used in ITU-T Group 3 fax compression a combination of RLE and Huffman compression ratio: 10:1 or better

Page 52: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)
Page 53: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)
Page 54: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

54

Lempel-Ziv-Welch algorithm

Method due to Ziv, Lempel in 1977, 1978. Improvement by Welch in 1984. Hence, named LZW compression.

used in UNIX compress, GIF, and V.42bis compression modems.

Page 55: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

55

Lempel-Ziv-Welch algorithm

Motivation: Suppose we want to encode a piece of English text. We

first encode the Webster’s English dictionary which contains about 159,000 entries. Here, we surely can use 18 bits for a word. Then why don’t we just represent each word in the text as an 18 bit number, and so obtain a compressed version of the text?

Page 56: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

56

LZW

A: 34Simple: 278Example: 1022

Dictionary

A_simple_example

34 278 1022

A: 34Simple: 278Example: 1022

Dictionary

Using ASCII: 16 symbols, 7 bits each 112 bits

Using the dictionary approach: 3 symbols, 18 bits each 54 bits

Page 57: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

57

LZW

Problems: Need to preload a copy of the dictionary before encoding

and decoding. Dictionary is huge, 159,000 entries. Most of the dictionary entries are not used. Many words cannot be found in the dictionary.

Solution: Build the dictionary on-line while encoding and decoding. Build the dictionary adaptively. Words not used in the text

are not incorporated in the dictionary.

Page 58: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

58

LZW

Example: encode “^WED^WE^WEE^WEB^WET” using LZW.

Assume each symbol (character) is encoded by an 8-bit code (i.e., the codes for the characters run from 0 to 255). We assume that the dictionary is loaded with these 256 codes initially (the base dictionary).

While scanning the string, sub-string patterns are encoded into the dictionary. Patterns that appear after the first time would then be substituted by the index into the dictionary.

Page 59: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

59

Example 1 (compression)

^WED^WE^WEE^WEB^WET Dict Index Output

^ ^W ^W 256 16 2c WE WE 257 23 2c ED ED 258 40 2c D^ D^ 259 77 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 40 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 254 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T 95

buffer

^ 16

W 23

D 77

E 40

B 254

T 95

Base dictionary(partial)

Page 60: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

60

LZW (encoding rules)

Rule 1: whenever the pattern in the buffer is in the dictionary, we try to extend it.

Rule 2: whenever a new pattern is formed in the buffer, we add the pattern to the dictionary.

Rule 3: whenever a pattern is added to the dictionary, we output the codeword of the pattern before the last extension.

Page 61: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

61

Example 1

Original: 19 symbols, 8-bit code each 19 8 = 152 bits

LZW’ed: 12 codes, 9 bits each 12 9 = 108 bits Some compression achieved. Compression

effectiveness is more substantial for long messages. The dictionary is not explicitly stored and sent to

the decoder. In a sense, the compressor is encoding the dictionary in the output stream implicitly.

Page 62: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

62

LZW decompression

Input: “{16}{23}{40}{77}{256}{40}{260}{261}{257}{254}{260}

{95}”. Assume the decoder also knows about the first 256

entries of the dictionary (the base dictionary). While scanning the code stream for decoding, the

dictionary is rebuilt. When an index is read, the corresponding dictionary entry is retrieved to splice in position.

The decoder deduces what the buffer of the encoder has gone through and rebuild the dictionary.

^ 16

W 23

D 77

E 40

B 254

T 95

Page 63: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

63

Example 1(decompression)

16.23.40.77.256.40.260.261.257.254.260.95 Dict Index Output

16 ^ 16.23 ^W 256 W 23.40 WE 257 E 40.77 ED 258 D 77.256 D^ 259 ^W 256.40 ^WE 260 E 40.260 E^ 261 ^WE 260.261 ^WEE 262 E^ 261.257 E^W 263 WE 257.254 WEB 264 B 254.260 B^ 265 ^WE 260.95 ^WET 266 T 95

^ 16

W 23

D 77

E 40

B 254

T 95

Page 64: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

64

LZW (decoding rule)

Whenever there are 2 codes C1C2 in the buffer, add a new entry to the dictionary: symbols of C1 + first symbol of C2

Page 65: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

w = NIL; while ( read a symbol k ) { if ( wk exists in dictionary ) w = wk; // proceed to next else { output code for w; add wk to dictionary; w = k; } }

Compression code

Decompression coderead a code k; output symbol represented by k; w = k; while ( read a code k ) { ent = dictionary entry for k; output symbol(s) in entry ent; add w + ent[0] to dictionary w = ent; }

Page 66: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

66

Example 2

ababcbababaaaaaaa

Page 67: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

67

Example 2 (compression)

ababcbababaaaaaaa$ Dict Index Output

a 1 b 2 c 3 a ab ab 4 1 ba ba 5 2 abc abc 6 4 cb cb 7 3 bab bab 8 5 baba baba 9 8 aa aa 10 1 aaa aaa 11 10 aaaa aaaa 12 11 a$ 1

Page 68: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

68

LZW: Pros and Cons

LZW (advantages) effective at exploiting:

character repetition redundancyhigh-usage pattern redundancy

an adaptive algorithm (no need to learn about the statistical properties of the symbols beforehand)

one-pass procedure

Page 69: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

69

LZW: Pros and Cons

LZW (disadvantage) Messages must be long for effective compression

It takes time for the dictionary to build up.

Page 70: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

70

LZW (Implementation)

decompression is generally faster than compression. Why?

usually uses 12-bit codes 4096 table entries.

need a hash table for compression.

Page 71: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

71

LZW and GIF

11 colorscreate a color map of11 entries:

color code[r0,g0,b0] 0[r1,g1,b1] 1[r2,g2,b2] 2… …[r10,g10,b10] 10

convert the image intoa sequence of pixel colors,apply LZW to thecode stream.

320 x 200

Page 72: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

GIF, 5280 bytes(0.66 bpp)

JPEG, 16327 bytes(2.04 bpp)

bit per pixel

Page 73: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

73

GIF image

Page 74: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types (uncompressed)

74

JPEG image