2. Text Compression

2. Text Compression

강의 노트 (2 주 )

2

압축이 필요한 이유

• 컴퓨터 하드웨어 발전 필요한 자료의 양의 증가 속도

• 인터넷 홈페이지 • 새로운 응용 멀티미디어 , Genome,

전자도서관 , 전자상거래 , 인트라넷 • 압축이 되면 처리 속도도 빨라진다 !!!!

3

역사

• 1950’s : Huffman coding• 1970’s Ziv Lempel, Arithmetic coding• English Text

– Huffman (5bits/character) – Ziv-Lempel (4bits/character) – Arithmetic coding (2bits/character)

• PPM ::: Prediction by Partial Matching– Slow and require large amount of memory

4

강의 내용

Models Adaptive models Coding Symbolwise models Dictionary models Synchronization Performance comparison

5

Symbol-wise methods

– Estimating the probabilities of symbols– Huffman coding or arithmetic coding– Modeling : estimating probabilities – Coding: converting the probabilities into bits

treams

6

Dictionary methods

• Code references to entries in the dictionary• Several symbols as one output codeword • Statistical method• Ziv-Lempel coding by referencing (pointing) pre

vious occurrence of strings adapt

• Hybrid schemes 효율은 symbolwise schemes 보다 좋지 않으나 속도 증가

7

Models

prediction !!!! Fig. 2.1 Information Content

I(s) = -log Pr[s] (bits) 확률분포의 entropy

H = Pr[s]·I(s) = - Pr[s]·logPr[s] prediction 이 매우 잘 되면 Huffma

n coding 은 성능이 나빠진다 !!!

8

Pr[]

• 확률이 ‘ 1’ 이면 전송이 필요 없다• 확률이 ‘ 0’ 이면 coding 될 수 없다 • ‘u’ 의 확률이 2% 이면 5.6bits 필요• ‘q’ 다음에 ‘ u’ 가 95% 확률로 나오면

0.074bits 필요 잘못 예측하면 추가의 bit 가 소요 !!!

9

Models

finite-context model of order m- 앞에 나온 m 개의 symbol 을 이용하여 예측

finite-state model [Figure 2.2]

10

Modeling 방법

static modeling- 텍스트의 내용에 관계없이 항상 같은 모델 사용

semi-static modeling- 각각의 파일마다 새로운 모델 사용- 사전에 모델을 전송해야 !!!

adaptive modeling- 새로운 symbol 을 만날 때마다 확률 분포가 변화

11

Adaptive models zero-order model character by character zero frequency problem

- 어떤 character 가 지금까지 한번도 나타나지 않았을 때

- 1/(46*(768,078+1)) ? 1/(768,079+128) higher-order model

- first-order model ::: 37,526(‘h’) 1,139(‘t’)1,139/37,526 93.02%) (0-probability 는 무시

- second-order model ::: ‘gh’ ‘t’ (64%, 0.636bits)

12

adaptive modeling

• 장점• Robust, Reliable, Flexible

• 단점 • Random access is impossible • fragile on communication errors

• Good for general compression utilities but not good for full-text retrieval

13

Coding coding 의 기능

- model 에 의해 제공된 확률 분포를 바탕으로 symbol 을 어떻게 나타낼지를 결정

coding 시 주의점- short codewords for likely symbols- long codewords for rare symbols- 속도

Huffman coding Arithmetic coding

14

Huffman Coding

static model 을 사용할 때 encoding 과 decoding 속도가 빠름

adaptive Huffman coding - memory 나 시간이 많이 필요

full-text retrieval application 에 유용 random access 가 용이

15

Examples

• a 0000 0.05

b 0001 0.005 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1

• Eefggfed• 10101101111111101001• Prefix-(free) code

16

Algorithm

• Fig. 2.6 설명

• Fast for both encoding and decoding• Adaptive Huffman coding 도 있으나 arit

hmetic coding 이 오히려 나음

17

Canonical Huffman Coding IHuffman code 와 같은 길이의 codeword 사용

codeword 의 길이가 긴 것부터 저장 같은 빈도로 나타나는 단어인 단어는 자모순 encoding 은 쉽게 코드의 길이와 같은 길이의 첫

번째 코드에서 상대적 위치와 첫번째 코드만 알면 가능

예 ::: Table 2.2 에서 ‘ said’ 는 7bit 짜리 중에서 10 번째 , 첫번째 코드 ‘ 1010100’ ‘1010100’+’1001’ = ‘1011101`

18

Canonical Huffman Coding II

Decoding : 심벌을 Codeword 의 순서대로 저장 + 코드길이에 따른 첫번 째 코드1100000101… 7bits(‘1010100), 6bits(11000

1) … 7bits 에서 12 번째 뒤 (with) decoding tree 를 사용하지 않음

19

Canonical Huffman Coding III

• Word 와 확률만 정해지면 유일함• 표 2.3 참고• Canonical Huffman code 는 Huffman algorith

m 에 의해 만들어 지지 않을 수 있다 !!!!!!!• Huffman 이 말한 바에 따르면 알고리즘이

바뀌어야 한다 !!!! 코드 길이를 계산하는 것으로 !!! – n 개 symbol 에 대해 2n-1 – 그 중 한 개가 canonical Huffman code

20

Canonical Huffman code IV

• Tree 를 만들 필요가 없으므로 memory 절약• Tree 를 찾을 필요가 없으므로 시간 절약

• 코드길이를 먼저 알고 , 위치를 계산하여 코드 값을 부여한다… 방법 설명 – – 긴 것 부터 !!! 1씩 더하면 !!!! 길이에 맞게 자르

면 !!!! – [ 바로 큰 길이 첫번 째 코드 + 동일 코드 개수 +

1] 에 길이만큼 자르면 !!!!

21

알고리즘

• 단순히 tree 를 만들면 24n bytes– 값 , pointer (2 개 )– Intermediate node + leaf node 2n

• 8n bytes 알고리즘– Heap 의 사용– 2n 개 정수 array– 알고리즘은 직접 쓰면서 설명 !!!!!

22

Arithmetic Coding 복잡한 model 을 사용하여 높은 압축률 얻음

- entropy 에 근접한 길이로 coding 한 symbol 을 1bit 이하로 표현 가능 특히

한 symbol 이 높은 확률로 나타날 때 유리 tree 를 저장하지 않기 때문에 적은 메모리

필요 static 이나 semi-static application 에서는 Hu

ffman coding 보다 느림 random access 어려움

23

Huffman Code 와 Arithmetic Code

Huffman Coding Arithmetic Coding Static model에 유리 Adaptive model에 유리

아무리 높은 확률의 symbol 이라도 최소 한 bit 이하로 압축할 수 없다. – 해결책 : blocking 구현이 어렵다.

확률이 높은 symbol 을 적은 bit로 표현 가능하다.

많은 메모리 필요 : decoding tree를 저장

적은 메모리 필요 : tree 를 저장하지 않음

빠른 속도 : 미리 계산된 확률, 미리 정해진 codeword

느린 속도 : 실시간 확률과 range 계산

Random access 가능 Random access 어려움 full-text retrieval에서 text 압축에 사용됨

full-text retrieval 에서 image 압축에 사용됨

24

Transmission of output

• low = 0.6334 high = 0.6667– ‘6’, 0.334 0.667

• 32bit precession 으로 크게 압출률 감소는 없음

25

Arithmetic Coding (Static Model)

{ a, b, EO F } set 로 이 루 어 진 에 서 "bbaa" .를 압 축 한 다 Pr[ a] = 0.4, Pr[ b] = 0.5, Pr[ EO F] = 0.1

EO F

1.0

ba

0.4 0.90.0

EO F

0.9

ba

0.6 0.850.4

EO F

0.4

ba

0.16 0.360.0

EO F

0.7

ba

0.64 0.690.6

EO F

0.85

ba

0.7 0.8250.6

b : low = 0.4 high = 0.9입 력 Prefix =

b : low = 0.6 high = 0.85입 력 Prefix =

EO F : low = 0.36 high = 0.4입 력 Prefix =

a : low = 0.6 high = 0.64입 력 Prefix = 6

a : low = 0.6 high = 0.7입 력 Prefix =

O utput : 6

O utput : 36

26

Decoding(Static Model)

EO F

1.0

ba

0.4 0.90.0

EO F

0.9

ba

0.6 0.850.4

EO F

0.4

ba

0.16 0.360.0

EO F

0.7

ba

0.64 0.690.6

EO F

0.85

ba

0.7 0.8250.6

6 입 력

O utput : b

O utput : a

O utput : b

O utput : a

36 입 력

27

Arithmetic Coding (Adaptive Model)

{ a, b, EO F } set 로 이 루 어 진 에 서 "bbaa" . .를 압 축 한 다 초 기 확 률 은 다 음 과 같 다 Pr[ a] = 0.333, Pr[ b] = 0.333, Pr[ EO F] = 0.333

EO F

1.0

a

0.333 0.6660.0

EO F

0.666

a

0.4165 0.58320.333

EO F

0.2959

ba

0.2211 0.27720.165

EO F

0.498

b

0.2959 0.44250.165

EO F

0.5832

b

0.4498 0.54980.4165

b : Pr[ a] = 입 력1/ 4, Pr[ b] =2/ 4, Pr[ EO F] = 1/ 4

O utput : 4

O utput : 28

b

b

a

a

a : Pr[ a] = 입 력3/ 7, Pr[ b] =3/ 7, Pr[ EO F] = 1/ 7

a : Pr[ a] = 입 력2/ 6, Pr[ b] =3/ 6, Pr[ EO F] = 1/ 6

b : Pr[ a] = 1/ 5, Pr[ b] = 3/ 5, 입 력 Pr[ EO F] = 1/ 5

EO F입 력

28

Decoding(Adaptive Model)

EO F

1.0

a

0.333 0.6660.0

EO F

0.666

a

0.4165 0.58320.333

EO F

0.2959

ba

0.2211 0.27720.165

EO F

0.498

b

0.2959 0.44250.165

EO F

0.5832

b

0.4498 0.54980.4165

b

b

a

a

4 입 력

28입 력

O utput : b

O utput : a

O utput : b

O utput : a

O utput : EO F

29

Cumulative Count Calculation

• 방법 설명 – Heap – Encoding 101101 101101, 1011, 101,

1– 규칙 설명

30

Symbolwise models

Symbolwise model + coder( arithmatic, huffman )

Three Approaches

- PPM( Prediction by Partial Matching )

- DMC(Dynamic Markov Compression )

- Word-based compression

31

PPM ( Prediction by Partial Matching )

finite-context models of characters

variable-length code 이전의 code 화 된 text 와 partial matching zero-frequency problem

- Escape symbol

- Escape symbol 을 1 로 count (PPMA)

32

Escape method

• Escape method A (PPMA) count 1• Exclusion• Method C :: r/(n+r) total n, distinct symbol

s r, ci/(n+r)• Method D :: r/(2n)• Method X :: symbols of frequency 1 t1, (t1

+1)/(n+t1+1) • PPMZ, Swiss Army Knife Data Compression (SAKDC)

• 그림 2,24

33

Block-sorting compression

34

DMC ( Dynamic Markov Compression )

finite state model

adaptive model - Probabilties and the structure of the finite state machine Figure 2.13

avoid zero-frequency problem

Figure 2.14

Cloning - heuristic - the adaptation of the structure of a DMC

35

Word-based Compression

parse a document into “words” and “nonwords”

Textual/Non-Textual 구분 압축 - Textual : zero-order model

suitable for large full-text database

Low Frequency Word - 비효율적 - 예 ) 연속된 Digit, Page Number

36

Dictionary Models

Principle of replacing substrings in a text with codeword

Adaptive dictionary compression model : LZ77, LZ78

Approaches

- LZ77

- Gzip

- LZ78

- LZW

37

Dictionary Model - LZ77

adaptive dictionary model

characteristic - easy to implement - quick decoding - using small amount of memory

Figure 2.16

Triples

< offset, length of phrase, character >

38

Dictionary Model - LZ77(continue)

Improve

- offset : shorter codewords for recent matches

- match length : variable length code

- character : 필요시에만 포함 (raw data 전송 )

Figure 2.17

39

Dictionary Model - Gzip

based on LZ77

hash table

Tuples

< offset, matched length >

Using Huffman code

- semi-static / canonical Huffman code

- 64K Blocks

- Code Table : Block 시작 위치

40

Dictionary Model - LZ78

adaptive dictionary model

parsed phrase reference

Tuples

- < phrase number, character >

- phrase 0 : empty string

Figure 2.19

Figure 2.18

41

Dictionary Model - LZ78(continue)

characteristic

- hash table : simple, fast

- encoding : fast

- decoding : slow

- trie : memory 사용 많음

42

Dictionary Model - LZW

variant of LZ78

encode only the phrase number

does not have explicit characters in the output

appending the fast character of the next phrase

Figure 2.20

characteristic

- good compression

- easy to implement

43

Synchronization

random access

- variable-length code

- adaptive model

synchronization point

synchronization with adaptive model

- large file -> break into small sections

impossible random access

44

Creating synchronization point

main text : consist of a number of documents

- 문서의 시작 /끝에 추가 bit 로 길이 표시 bit offset

byte offset

- end of document symbol

- length of each document at its beginning

- end of file

45

Self-synchronizing codes

not useful or full-text retrieval

- compressed text 의 중간에서 decoding synchronizing cycle 을 찾아 decoding

- part of corrupteed, beginning is missing

motivation

fixed-length code : self-synchronizing 불가 Table 2.3

Figure 2.22

46

Performance comparisons

consideration

- compression speed

- compression performance

- computing resource

Table 2.4

47

Compression Performance

Calgary corpus

- English text, program source code, bilevel fascimile image

- geological data, program object code

Figure 2.24

Bits per character

48

Compression speed

speed dependency

- method of implementation

- architecure of machine

- compiler

Better compression, Slower program run

Ziv-Lempel based method : decoding > encoding

Table 2.6

49

Other Performance considerations

memory usage

- adaptive model : 많은 memory 사용

- Ziv-Lempel << Symbolwise model

Random access

- synchronization point

Documents

2. Text Compression