Compression - ITUitu.dk/people/pagh/ads11/12-compression.pdf · • No normal compression method ﬁnds this pattern • Compression models all based on repetition and/or ... C 010

Data CompressionGuest lecture, SGDS Fall 2011

1

Basics

Alphabet compaction

Compression is impossible

Compression is possible

RLE

Variable-length codes

Using phrases

Lossy/lossless

Undecidable

Pigeon-holes

Patterns

Randomness

Huffman

Arithmetic coding

Dynamic context

Ziv-Lempel

Burrows-WheelerSuffix sorting

Data compression is not traditional alg. course topic.But interesting, both in itself and as application of alg./d.s.Book: fragments, not that well chosen from compression experts view.This lecture: fuller view, with connections to what you learned on the course.

2

Basic model

Basic model for data compression

Compress Expandbitstream B

0110110101...

original bitstream B

0110110101...

compressed version C(B)

1101011111...

Original message, consisting of characters, pixels, sound samples or whatever. In much of the lecture we assume that it consists of characters. But more generally, we can view it as just a stream of bits, because all data representations can be broken down to bits.Compression method: two algorithms: compress and expand.Seems impossible that you could get the original back, you would have to throw away some data. And sometimes you do.

3

Lossy

Compressed message

Images, video, sound, …

Compress Expand

If we accept loss, which we can do for some kinds of data, itʼs more believable that we can compress.

4

Lossless

Compressed message

Anything, including text, machine code, …

This lecture (and book): lossless only

Compress Expand

But there are also lossless methods, which reproduce the original exactly.Lossless techniques are useful also in lossy methods. Even when accepting loss, you want to represent exact information as compactly as possible.One case where it is fairly easy to accept is if there are unused bits in B, i.e., it does not store the data as compactly as it could.

5

Easy: alphabet compactionGenome String over alphabet { A, C, T, G }Encode N-character genome: ATAGATGCATAG…

01000001010101000100000101000111

01000001010101000100011101000011

01000001010101000100000101000111…

001000110010110100100011…

char encoding

A 01000001

C 01000011

T 01010100

G 01000111

Ascii bytes

char encoding

A 00

C 01

T 10

G 11

2-bit encoding

Thatʼs nice, but in general, there are not unused bits.

6

But, in general…

• Any representable data may appear

• No superfluous bits to remove

7

Computational formulation

• CompressInput: N-bit message BOutput: Smallest possible program, C(B), that produces B as output (when given no input)

• ExpandRun C(B), get B.

• Length of C(B) is Kolmogorov complexity of BUNDECIDABLE

The most general kind of code is a programming language. Letʼs say that C(B) is a program that produces B. Letʼs find the smallest such program.Undecidable: There is no, can be no, algorithm that computes it in general.Generally, one should not be too discouraged. Sometimes a non-general algorithm is useful.But letʼs make this easier, by requiring not that C(B) is the smallest possible, but just that it is smaller than B.

8

New attempt: skip “smallest possible”

• CompressInput: N-bit message BOutput: N′-bit message C(B), N′ < N

• ExpandInput: N′-bit message C(B)Output: N-bit message BIMPOSSI

BLE

Why is this impossible? The pigeon hole principle applies.

9

B: 2N possibilities C(B): 2N′ possibilities

Compress

Compression means mapping each dot on the left to some dot on the right.Since there are fewer possibilities for C(B) than B, there are some B1 and B2 for which C(B1) = C(B2).This is easy to see when N is 2 or 3 or so, but donʼt get fooled: it applies even if N is billions.

10

B: 2N possibilities C(B): 2N′ possibilities

Expand

?

Expand cannot choose between B1 and B2.

11

So, we give up?

Some of the 2N messages may be illegal. No need to encode them.Even if they are all legal, some are more probable than others.

12

Modified goal

• CompressInput: N-bit message BOutput: N′-bit message C(B)N′ < N for most common instances of BFor less common B: ok if N′ > N

• ExpandInput: N′-bit message C(B)Output: N-bit message B

A little vague, not really a mathematical definition. Need some more information theory to make a formal definition, which is beyond the scope of this lecture.

13

Example“Mary had a little lamb.”

Text (upper/lower case letters + punctuation), 6 bits/char

23 · 6 bits = 138 bits

Compressed, text can use fewer, say 2.5 bits/char, because text patterns are predictable.

23 · 2.5 bits = 57.5 bits

“hsY, iiMlh kWVsadjh h.j”

23 · 6 bits = 138 bits

“Compressed,” this data (with no predictable patterns) will use more, say 6.8 bits/char.

23 · 6.8 bits = 156.4 bits

LEFTWithout compression, just alphabet compaction, we can get 138.Compressed, we might get, e.g., 57.5.

RIGHTUncompressed, the same length“Compressed” we can allow this a little more.

So, how then? We have to find predictable patterns.

14

141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724587006606315588174881520920962829254091715364367892590360011330530548820466521384146951941511609433057270365759591953092186117381932611793105118548074462379962749567351885752724891227938183011949129833673362440656643086021394946395224737190 70217986094370277053921717

“570 first decimals of π”

• No normal compression method finds this pattern

• Compression models all based on repetition and/or skewed distribution

If we donʼt have special knowledge (of π in this case), the message looks random.

15

Randomness

• Message that looks random will not be compressed

• Sequence that is truly random cannot be compressed (pigeon-holes again)

• Maximum-compressed data looks random

“Looking random” depends on the model used. Every compression method has one, explicit or implicit.Now letʼs look at a message where we can easily see some pattern.

16

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1

“15×0 7×1 7×0 11×1”1111 101101110111

1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1

1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1

• Run-length encoding (RLE)

• How to compress 1 1 1 1 0 0 0 1 1 ?

• How to compress 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ?

If you would just describe this bit sequence, how would you do it? Letʼs use that as a compression format.To make it into a bit string, we need to encode numbers in binary too (next slide).111100011 becomes 0000 0100 0011 0010.00000000000000000 becomes 1111 0000 0010This compressed 40 bits into 16 bits. What compression do we generally get with this method?

17

Decimal Binary

0 00001 00012 00103 00114 01005 01016 01107 01118 10009 100110 101011 101112 110013 110114 111015 1111

18

RLE compression efficiency

What sequence gives best compression?

“15×0 15×1 15×0 15×1 15×0 15×1…”

4/15 ≈ 0.26667 bits/bit

Worst compression?

“0101010101010101010101010101…”

4/1 ≈ 4 bits/bit

More (than 4) bits for lengths: better best case, worse worse case

Used as component in some systems, but not a good general compression scheme.Letʼs look at a text example, with a more intricate pattern.

19

“ABRACADABRA!”

First attempt: alphabet compaction

char encoding

A 000

B 001

C 010

D 011

R 100

! 101

12· 3 bits = 36 bits

Encoding 000001100000010000011000001100000101.But do we have to have the same number of bits for all characters?

20

“ABRACADABRA!”

Won’t work!(why not?)

Can variable-length code work?Yes! If it is prefix-free

char encoding

A 0

B 1

C 01

D 10

R 00

! 11

Encoding 01000010100100021

“ABRACADABRA!”

Try variable lengths, with short codewords for common characters.

30 bits total, less than 36!

char encoding

A 0

B 11111

C 110

D 100

R 1110

! 101

So, we seem to have found a trick. Letʼs look at a more intuitive way to represent this code.

22

Tree representation

Two pre!x-free codes

011111110011001000111111100101A B RA CA DA B RA !

101011111101001110

!ABCDR

key value

D !

00 11

C

A

R B

00 11

00 11

00 11

00 11

30 bits

11000111101011100110001111101 A B R A C A D A B R A !

1011100010100011

!ABCDR

key value

C R

AB

00 11

00 1100 11

00 11

D !

00 11

29 bits

Trie representationCodeword table

Trie representationCodeword table

Compressed bitstring

Compressed bitstring

CompressStart at leaf; follow path up to the root; print bits in reverse.

ExpandStart at root.Go left if bit is 0; go right if 1. If leaf node, print char and return to root.

But: How do we find the best code?

Code in the book.Now, how do we make best use of this trick?

23

Huffman code

Count frequencies of charactersMake a set with one node for each letterExtract two nodes with smallest frequencyCombine them, with new node as rootAdd new root node to setRepeat, until only one node

(Optimality proof: see book.)

24

Huffman code

C!

Huffman code construction for A B R A C A D A B R A !

1 1

1

5

22

0 1

12

10

7

10

3

10

4

D

10

2

R B

A

char freq

A 5

B 2

C 1

D 1

R 2

! 1

Little red numbers are frequencies.

25

Huffman code

• Compress N characters, alphabet size R

• Data structure(s)?

• Time complexity?

Count frequencies: N

Build binary min-heap, based on freq.: R(R-1) steps, extract two insert one: R lg R

Alt. use two FIFO queues Q1, Q2Sort on freq., insert into Q1 in freq. order: sort-time(R, values 0…N)Min-freq. node is always next to get from either Q1 or Q2Insert new nodes in Q2sort-time(R, values 0…N) = R lg R? No, key-indexed sorting can normally get it down to R.

But how does expand know what the encoding is?

26

Compressed message must include code

• Codeword of each character (book)

• … or, frequency of each character. Expand builds tree in same way as compress

• … or, the length of the codeword for each character. Enough info to rebuild tree

• Note: Huffman can automatically compact alphabet

No problem if alphabet is relatively small.If we donʼt include characters with zero frequency in the code, we get natural compaction.Many descriptions stop here. We found the optimal way to compress. But far from it.

27

The curse ofwhole-bit codewords

char freq encoding

A 990/1000 0

B 7/1000 10

C 3/1000 11

Huffman-encoding characters is not always the best we can do

Example: 1000-char message with highly skewed distribution

Total: 990· 1 + 7· 2 + 3· 2 = 1010 bitsRLE would do better!

How can we do better? One way is to use another alphabet.

28

Use double characterschar freq (computed) encoding

AA (990/1000)2 ≈ .98 0

AB (990/1000)·(7/1000) ≈ .0069 10

AC (990/1000)·(3/1000) ≈ .0030 1110

BA (7/1000)·(990/1000) ≈ .0069 110

BB (7/1000)2 ≈ .000049 111110

BC (7/1000)·(3/1000) ≈ .000021 1111110

CA (3/1000)·(990/1000) ≈ .0030 11110

CB (3/1000)·(7/1000) ≈ .000021 11111110

CC (3/1000)2 ≈ .000009 11111111

Total: ca 600 bits

29

Keep expanding alphabet…

• Combining three characters, to alphabet size 27, improves precision further. Etc.

• Finally: combining N characters. Message is one single character⇒ “Arithmetic coding”

• Arithmetic encoder takes one freq. interval at a time, outputs bits as they can be determined.

We do not go into details for how to do arithmetic coding in practice. Just please accept that the problem has a solution.

30

Entropy coding

• Huffman, Shannon-Fano, canonical code, arithmetic coding…

• … techniques exist to output right number of bits, with sufficient precision

• For details, see e.g. Witten, Moffat, & Bell, Managing Gigabytes

31

But, wait a minute…char freq (computed)

AA (990/1000)2 ≈ .98

AB (990/1000)·(7/1000) ≈ .0069

AC (990/1000)·(3/1000) ≈ .0030

BA (7/1000)·(990/1000) ≈ .0069

BB (7/1000)2 ≈ .000049

BC (7/1000)·(3/1000) ≈ .000021

CA (3/1000)·(990/1000) ≈ .0030

CB (3/1000)·(7/1000) ≈ .000021

CC (3/1000)2 ≈ .000009

These are clearly not the best frequency estimates

For instance, in English, “Th” is more common than “hT”

We can get more data from the original message

32

Idea I:Statistics with context

• Example: in English, the letter “u” is not among the most common few…

• … except after “q”, where it is by far the most common!

• Idea: use different frequency tables based on the previous character

33

char freq

A 0

B 2

C 1

D 1

R 0

! 1

“ABRACADABRA!”

char freq

A 1

B 0

C 0

D 0

R 0

! 0

char freq

A 0

B 0

C 0

D 2

R 0

! 0

…

After A After CAfter B

Build, e.g., different Huffman codes for each context

34

More detailed contexts

• Example, after “compres”, “s” is overrepresented

• Use longer strings as context: those significant in message

• Problem: lots of codes! Need to be included in compressed message?

• Solution: dynamic contexts

35

Context tree for string letlettertele

context et has

appeared 2 times

36

Dynamiccontext modeling

• Start with just one, or R, contexts. Entries in frequency tables equal

• Add contexts and update statistics by one character at a time

• Build exactly the same way in expand as in compress. No code needs to be included in compressed method!

• Prediction by partial matching (PPM),Dynamic Markov Chaining (DMC)

• Good compression properties, but take much computation in both compress and expand

37

Idea 2:Build dictionaries

• Instead of individual characters, encode “phrases”

• Computationally simpler than statistical modeling

• Less sensitive to lack of precision in bit codes (alphabet is large)

• Dictionary methods are equivalent to (weird) special cases of statistical models

38

LZ77

<pos, length, next>

Compressed message consists of triples

position (counting

backwards) of phrase

number of characters

in phrase

first character

after phrase

39

<0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <6,6,b>

abaababaabbabaabbbExpand:

Considered impractical for years, because scanning for longest string during compression takes N2 time…

… but does it?

Design compression algorithm!Data structures? Time complexity?

40

Idea 3:Block sorting

• Group characters in the output according to their contexts

• More similar contexts, closer together

• Generates repetitions more easy to compress

41

Idea 3:Block sorting

• In chunk of message, sort all strings (contexts)

• Encode characters in their sorted-context order, lots of repetition

• Then compress with RLE and/or move to front

• Remarkably, it’s easy to get original order back!

• Burrows–Wheeler transform (BWT)

Contexts are strings, so we can use string sorting for grouping/ordering.

42

Note on backward contexts

• String after a character works as contest (just as well as string before)

• After “compres”, “s” is overrepresented…

• … before “ompress”, “c” is overrepresented

43

“abraca”

Sort rotationsEncode row of original messageEncode last characters in rows

Transformed message: <1, “caraab”>

row0 aabrac1 abraca2 acaabr3 bracaa4 caabra5 racaab

row0 c1 a2 r3 a4 a5 b

44

Expand

row0 c1 a2 r3 a4 a5 b

45

Expand

row0 a c1 a a2 a r3 b a4 c a5 r b

46

Expand


rotated

ca aa ra ab ac br

sorted on

second character

47

Expand


rotated

ca aa ra ab ac br

sorted on

second character

T405123

48

Expand


rotated

ca aa ra ab ac br

T405123

a

49

Expand

row0 a c1 a ca2 a r3 b a4 c a5 r b

rotated

ca aa ra ab ac br

T405123

ca

50

Expand

row0 a c1 a aca2 a r3 b a4 c a5 r b

rotated

ca aa ra ab ac br

T405123

aca

51

Expand

row0 a c1 a raca2 a r3 b a4 c a5 r b

rotated

ca aa ra ab ac br

T405123

raca

52

Expand

row0 a c1 abraca2 a r3 b a4 c a5 r b

rotated

ca aa ra ab ac br

T405123

braca

Expand is quick, linear time. Compress is heavier, because of rotation sorting.

53

Expand

row0 a c1 abraca2 a r3 b a4 c a5 r b

rotated

ca aa ra ab ac br

T405123

abraca

Expand is quick, linear time. Compress is heavier, because of rotation sorting.

54

Rotation sorting≈ suffix sorting

• Add implicit last character $, smallest in alphabet

• Sorting rotations of abraca$ = sorting suffixes of abraca$

55

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

0 b a b a a a a b c b a b a a a a a $1 a b a a a a b c b a b a a a a a $2 b a a a a b c b a b a a a a a $3 a a a a b c b a b a a a a a $4 a a a b c b a b a a a a a $5 a a b c b a b a a a a a $6 a b c b a b a a a a a $7 b c b a b a a a a a $8 c b a b a a a a a $9 b a b a a a a a $10 a b a a a a a $11 b a a a a a $12 a a a a a $13 a a a a $14 a a a $15 a a $16 a $17 $

b a b a a a a b c b a b a a a a a $Suffix sorting

17 $16 a $15 a a $14 a a a $13 a a a a $12 a a a a a $3 a a a a b c b a b a a a a a $4 a a a b c b a b a a a a a $5 a a b c b a b a a a a a $10 a b a a a a a $1 a b a a a a b c b a b a a a a a $6 a b c b a b a a a a a $11 b a a a a a $2 b a a a a b c b a b a a a a a $9 b a b a a a a a $0 b a b a a a a b c b a b a a a a a $7 b c b a b a a a a a $8 c b a b a a a a a $

a a a a a b b a a b b a a a c $ a bBWT output16 15 14 13 12 11 2 3 4 9 0 5 10 1 8 17 6 7

Space is linear, but sorting “sees” quadratic data. Comparisons take linear time. So, comparison-based algorithm has worst case order of growth N2 lg n.

56

Suffix sortingtime complexity

• Naive: at least N2 in the worst case

• Prefix doubling: N lg N

• Suffix tree, recursive: N

Suffix sorting is the computationally heaviest part of BWT. Specialized methods exist that improve on the worst case.

57

Documents

Compression - ITUitu.dk/people/pagh/ads11/12-compression.pdf · • No normal compression method ﬁnds this pattern • Compression models all based on repetition and/or ... C 010