53
CPSC 335 Intermediate Information Structures LECTURE 11 Compression and Huffman Coding Jon Rokne Computer Science University of Calgary Canada Modified from Marina’s lectures

Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/LECTURE… · Intermediate Information Structures ... n Adaptive Huffman coding . 3 CODES

  • Upload
    dangnhi

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

CPSC 335

Intermediate Information Structures

LECTURE 11 Compression and Huffman Coding

Jon Rokne

Computer Science

University of Calgary

Canada

Modified from Marina’s lectures

Lecture Overview n  Codes and Optimal Codes n  Huffman Coding n  Non-determinism of the

algorithm n  Implementations:

n  Singly-linked List n  Doubly-linked list n  Recursive top-down n  Using heap

n  Adaptive Huffman coding

3

CODES

A 000 C 010 E 100 G 110 B 001 D 011 F 101 H 111 With this code, the message BACADAEAFABBAAAGAH is encoded as the string of 54 bits 001000010000011000100000101000001001000000000110000111 It is sometimes advantageous to use variable-length codes, in which different symbols may be represented by different numbers of bits. For example, Morse code does not use the same number of dots and dashes for each letter of the alphabet. In particular, E, the most frequent letter, is represented by a single dot. In general, if our messages are such that some symbols appear very frequently and some very rarely, we can encode data more efficiently (i.e., using fewer bits per message) if we assign shorter codes to the frequent symbols. Consider the following alternative code for the letters A through H: A 0 C 1010 E 1100 G 1110 B 100 D 1011 F 1101 H 1111 With this code, the same message as above is encoded as the string 100010100101101100011010100100000111001111 This string contains 42 bits, so it saves more than 20% in space in comparison with the fixed-length code shown above.

4

Optimal codes

5

Optimal codes

Huffman Coding n  Algorithm is used to assign a codeword to each character in the

text according to their frequencies. The codeword is usually represented as a bitstring.

n  Algorithm starts with the set of individual trees, consisting of a single node, sorted in the order of increasing character probabilities.

n  Then two trees with the smallest probabilities are selected and processed so that they become the left and the right sub-tree of the parent node, combining their probabilities.

n  In the end, 0 are assigned to all left branches of the tree, 1 to all right branches, and the codewods for all leaves (characters) of the tree are generated.

6 pages copied from Cormen et al.

Huffman tree building exercise

14

Huffman Code Construction

n Character count in text. 125

Freq

93

80

76

73

71

61

55

41

40

E

Char

T

A

O

I

N

R

H

L

D

31

27

C

U

65

S

15

Huffman Code Construction

C U

31 27

125

Freq

93

80

76

73

71

61

55

41

40

E

Char

T

A

O

I

N

R

H

L

D

31

27

C

U

65

S

16

Huffman Code Construction Example from Uwisc

C U

58

31 27

125

Freq

93

80

76

73

71

61

55

41

40

E

Char

T

A

O

I

N

R

H

L

D

58

65

S

31

27

C

U

17

Huffman Code Construction

C U

58

D L

81

31 27

40 41

125

Freq

93

80

76

73

71

61

58

55

E

Char

T

A

O

I

N

R

H

81

65

S

41

40

L

D

18

Huffman Code Construction

H

C U

58

113

D L

81

31 27

55 40 41

125

Freq

93

80

76

73

71

61

113

E

Char

T

A

O

I

N

R

81

65

S

58

55

H

19

Huffman Code Construction

R S H

C U

58

113 126

D L

81

31 27

55 61 65 40 41

125

Freq

93

80

76

73

71

113

E

Char

T

A

O

I

N

81

126

61

R

65

S

126

20

Huffman Code Construction

R S N I H

C U

58

113 144 126

D L

81

31 27

55 71 73 61 65 40 41

125

Freq

93

80

76

144

113

E

Char

T

A

O

81

73

71

I

N

144

21

Huffman Code Construction

R S N I H

C U

58

113 144 126

D L

81

156

A O

31 27

55 71 73 61 65 40 41

80 76

126

125

Freq

93

156

113

E

Char

T

81

80

76

A

O

22

Huffman Code Construction

R S N I H

C U

58

113 144 126

D L

81

156 174

A O T

31 27

55 71 73 61 65 40 41

93 80 76

144

126

125

Freq

156

113

E

Char

174

93

T

81

174

23

Huffman Code Construction

R S N I

E

H

C U

58

113 144 126

238

T

D L

81

156 174

A O

80 76

71 73 61 65 40 41

31 27

55

125 93

144

126

238

Freq

156

Char

125

113

E

24

Huffman Code Construction

R S N I

E

H

C U

58

113 144 126

238 270

T

D L

81

156 174

A O

31 27

55 71 73 61 65

125

40 41

93 80 76

238

156

270

Freq

174

Char

144

126

25

Huffman Code Construction

R S N I

E

H

C U

58

113 144 126

238 270

330

T

D L

81

156 174

A O

31 27

55 71 73 61 65

125

40 41

93 80 76

270

330

Freq

238

Char

156

174

26

Huffman Code Construction

R S N I

E

H

C U

58

113 144 126

238 270

330 508

T

D L

81

156 174

A O

31 27

55 71 73 61 65

125

40 41

93 80 76

330

508

Freq Char

270

238

27

Huffman Code Construction

R S N I

E

H

C U

58

113 144 126

238 270

330 508

838

T

D L

81

156 174

A O

31 27

55

71 73 61 65

125

40 41

93 80 76

838

Freq Char

330

508

28

Huffman Code Construction

R S N I

E

H

C U

0

0

T

D L

1

0 0

A O

0

1 1

1

1 0

0

1

1

1

1

1

1

0

0

0

0

0

1

125

Freq

93

80

76

73

71

61

55

41

40

E

Char

T

A

O

I

N

R

H

L

D

31

27

C

U

65

S

0000

Fixed

0001

0010

0011

0100

0101

0111

1000

1001

1010

1011

1100

0110

110

Huff

011

000

001

1011

1010

1000

1111

0101

0100

11100

11101

1001

838

Total

4.00

3.62

Non-determinism of the Huffman Coding

Non-determinism of the Huffman Coding

31

For another example, let’s encode an excerpt from Michael Jackson’s song Bad2. Because I’m bad, I’m bad-- come on Bad, bad-- really, really bad You know I’m bad, I’m bad-- you know it Bad, bad-- really, really bad You know I’m bad, I’m bad-- come on, you know Bad, bad-- really, really bad

Thanks to Jeff Boyd who pointed me to the paper PVRG-MPEG CODEC 1.1 by Andy C. Hung from which 4 slides have been taken.

32

The frequency of words in the song Bad.

33

The Huffman tree for the lyrics to Bad

34

The Huffman codes for the words in Bad.

Huffman Algorithm Implementation – Linked List n  Implementation depends on the ways to represent

the priority queue, which requires removing two smallest probabilities and inserting the new probability in the proper positions.

n  The first way to implement the priority queue is the singly linked list of references to trees, which resembles the algorithm presented in the previous slides.

n  The tree with the smallest probability is replaced by the newly created tree.

n  From the trees with the same probability, the first trees encountered are chosen.

Doubly Linked List n  All probability nodes are first ordered, the

first two trees are always removed.

n  The new tree is inserted at the end of the list in the sorted order.

n  A doubly-linked list of references to trees with immediate access to the beginning and to the end of this list is used.

Doubly Linked-List implementation

Recursive Implementation n  Top-down approach for building a tree starting

from the highest probability. The root probability is known if lower probabilities, in the root’s children, have been determined, the latter are known if the lower probabilities have been computed etc.

n  Thus, the recursive algorithm can be used.

HEAP n  A binary tree has the heap property iff it is empty or the key in the root is larger than that in either child and both subtrees have the heap property. Complete if all the leaves are on the same level or two adjacent ones and all nodes at the lowest level are as far to the left as possible.

39

80

81

If we number the nodes from 1 at the root and place: --the left child of node k at position 2k --the right child of node k at position 2k+1 Then the 'fill from the left' nature of the complete tree ensures that the heap can be stored in consecutive locations in an array.

INSERT into HEAP

Implementation using Heap n  The min-heap of probabilities is built.

n  The highest probability is put in the root.

n  Next, the heap property is restored

n  The smallest probability is removed and the root probability is set to the sum of two smallest probabilities.

n  The processing is complete when there is only one node in the heap left.

Huffman implementation with a heap

Huffman Coding for pairs of characters

n  Devised by Robert Gallager and improved by Donald Knuth.

n  Algorithm is based on the sibling property: if each node has a sibling, and the breadth-first right-to-left tree traversal generates a list of nodes with non-increasing frequency counters, it is a Huffman tree.

n  In adaptive Huffman coding, the tree includes a counter for each symbol updated every time corresponding symbol is being coded.

n  Checking whether the sibling property holds ensures that the tree under construction is a Huffman tree. If the sibling property is violated, the tree is restored.

Adaptive Huffman Coding

Adaptive Huffman Coding

Adaptive Huffman Coding

Sources n  Web links:

l  MP3 Converter:

http://www.mp3-onverter.com/mp3codec/huffman_coding.htm

l  Practical Huffman Coding: http://www.compressconsult.com/huffman/

n  Drozdek Textbook - Chapter 11

n  In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured).

n  It is suboptimal in the sense that it does not achieve the lowest possible expected code word length like Huffman coding; however unlike Huffman coding, it does guarantee that all code word lengths are within one bit of their theoretical ideal – entropy.

Shannon-Fano

n  For a given list of symbols, develop a corresponding list of probabilities or frequency counts so that each symbol’s relative frequency of occurrence is known.

n  Sort the lists of symbols according to frequency, with the most frequently occurring symbols at the left and the least common at the right.

n  Divide the list into two parts, with the total frequency counts of the left part being as close to the total of the right as possible.

n  The left part of the list is assigned the binary digit 0, and the right part is assigned the digit 1. This means that the codes for the symbols in the first part will all start with 0, and the codes in the second part will all start with 1.

n  Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree.

Shannon-Fano Coding

Shannon-Fano example

Shannon-Fano

n  References n  Shannon, C.E. (July 1948).

"A Mathematical Theory of Communication". Bell System Technical Journal 27: 379–423. http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.

n  Fano, R.M. (1949). "The transmission of information". Technical Report No. 65 (Cambridge (Mass.), USA: Research Laboratory of Electronics at MIT).