23
Satoshi Yoshida* and Takuya Kida* *Hokkaido University Graduate school of Information Science and Technology Division of Computer Science 1 ※ESKM2012(9月20⽇〜9月22⽇)にて発表予定

Satoshi Yoshida* and Takuya Kida*

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Satoshi Yoshida* and Takuya Kida*

Satoshi Yoshida* and Takuya Kida**Hokkaido University Graduate school of Information Science and

Technology Division of Computer Science

1

※ESKM2012(9月20⽇〜9月22⽇)にて発表予定

Page 2: Satoshi Yoshida* and Takuya Kida*

2

Input Text

Fixed Length Variable Length

CompressedText

FixedLength

FF Code(Fixed length to Fixed

length code)

VF Code(Variable length to Fixed

length code)

VariableLength

FV Code(Fixed length to

Variable length code)

VV Code(Variable length to Variable

length code)

Huffman Huffman Code

TunstallTunstallCode

Program Searching on Compressed Data

011101011101110100101001011101011101110100101001・・・

Compressed DataSearch Directly

Page 3: Satoshi Yoshida* and Takuya Kida*

3

Tunstall code [Tunstall, 1967]

AIVF code using multiple parse trees[Yamamoto and Yokoo, 2001]

AIVF code [Yamamoto and Yokoo, 2001]

utilizing unused codewords

utilizing context between blocks

YY codeYY code

Improves

considerably

Improves compression ratio

considerably

Page 4: Satoshi Yoshida* and Takuya Kida*

4

Multiple parse trees

001011010010010001…

VMA tree

Huge time Huge time and memory

Proposing methodProposing method

relatively to the number of kind of characters (k) in the input text

k-1 parse trees

Page 5: Satoshi Yoshida* and Takuya Kida*

� We can reduce the total number of nodes in comparison to YY code by using a VMA tree.

� We can reduce the total number of nodes and compression time considerably also in experiments.

� We found an upper bound and a lower bound of the number of nodes in a VMA tree.

5

Page 6: Satoshi Yoshida* and Takuya Kida*

Number of nodes in VMA tree

Number of nodes reduced byintegration

Upper bound �� �1

2�� �

1

2� � � � 1

1

2� � 1 � � 2

� �� � � � 1

� � 1� � � 2

Lower bound1

� � 1� �� � �

1

2�� �

1

2� � 3

6

k : alphabet sizeM : number of codewords

Page 7: Satoshi Yoshida* and Takuya Kida*

� Let Σ be a finite alphabet.

� Elements in alphabet are sorted in descending

order of their probabilities.

� We assume information source is memoryless.

� We simply say “encoding” encoding to the sequence

on {0, 1}.

7

aabbbbc Σ={a, b, c}

Page 8: Satoshi Yoshida* and Takuya Kida*

8

a

a

b

b

c

c a001

010

000

100

101

110011

111

a

a

Incomplete internal nodeAn internal node which doesn’thave k children.

input text:aabbbbc

compressed data sequence:001 101 101101 101 111

Each branch is labeled by a symbol

in Σ={a, b, c}.

Each leaf node and incomplete internal node has a codeword.

to the block.

Input text is parsed into blocks and we output

codeword corresponding to the block.

Page 9: Satoshi Yoshida* and Takuya Kida*

9

aa

b

b

c

ca

001 010000

100

101

111

011

a110

b c

a

a

b

b

c

c a001

010

000

100

101

110011

111

a

a

We switch multiple parse trees according to the context.

Page 10: Satoshi Yoshida* and Takuya Kida*

� Problem: time and space consuming◦ We construct k – 1 parse trees.

� We can reduce total number of nodes by sharing nodes.◦ We need to mark each node n in a VMA tree in order to tell which trees the node belongs to.◦ We can realize that by holding the least i such that nbelongs to Ti.

10

…0T1T

2−kT2T

VMA treemultiple parse trees

Page 11: Satoshi Yoshida* and Takuya Kida*

� Theorem

11

)(

1

i

iS +

)(

2

i

kS −

)(

1

i

kS −)(i

kS

)1(

2

+

+

i

iS)1(

2

+

i

kS)1(

1

+

i

kS)1( +i

kS

)(

2

i

iS +

Ti Ti +1

Let be the subtree of that consists of all the nodes under the node corresponding with , which is direct child of the root. Then completely covers . We denote this relation by .

)(i

jS iT

ja)1( +

+

i

jiS)(i

jiS +

)1()( +

++

i

ji

i

ji SS p

From this theorem, we can tell which trees a node belongs to easily.

Page 12: Satoshi Yoshida* and Takuya Kida*

12

0T 1T 2−kT

VT

)0(

1S)0(

2−kS)0(

1−kS

)0(

kS

)1(

2S)1(

2−kS)1(

1−kS)1(

kS

)2(

1

k

kS

)2( −k

kS

kS1−kS1S 2−kS

.SSS

,SSS

,SSS

,SSSS

,SSS

,SS

k

k

kk

k

k

kk

k

k

kk

=

=

=

=

=

=

−−

−−

)2()0(

1

)2(

1

)0(

1

2

)3(

2

)0(

2

3

)2(

3

)1(

3

)0(

3

2

)1(

2

)0(

2

1

)0(

1

pLp

pLp

pLp

M

pp

p

 

 

 

     

 

 

 

From the theorem, we have:

VMA tree

multiple parse trees

We call the integrated parse tree a VMA tree.

Page 13: Satoshi Yoshida* and Takuya Kida*

�Theorem

The number of reduced nodes is not less

than .

13

1

2�� �

1

2� � 3

Page 14: Satoshi Yoshida* and Takuya Kida*

1

2�� �

1

2� � 3

14

……

0T1T

2−kT

VT

)0(

1S

)0(

2−kS

)0(

1−kS

)0(

kS

)1(

2S

)1(

2−kS)1(

1−kS)1(

kS

)2(

1

k

kS

)2( −k

kS

)2( −k

kS)2(

1

k

kS)0(

1S)3(

2

k

kS

3−kT

)3(

2

k

kS

)3( −k

kS)3(

1

k

kS

)1(

3S)0(

2S

)1(

2S

Each reduced subtree has at least one node.

1−k 2−k 2

Summation:

Page 15: Satoshi Yoshida* and Takuya Kida*

�Theorem

The number of nodes in a VMA tree is

not less than .

15

1

� � 1� �� � �

Page 16: Satoshi Yoshida* and Takuya Kida*

16

��

�� �� �� �

� ����� ����

� ���

���� ���� ���� � �� �

Largest subtree and it remains in VMA tree.

Pr �� � Pr �� � ⋯ � Pr ��

The VMA tree is smallest when Pr �� � Pr �� � ⋯ � Pr �� .

Page 17: Satoshi Yoshida* and Takuya Kida*

17

VT

)2( −k

kS)2(

1

k

kS)0(

1S)3(

2

k

kS)1(

2S

#of codewords:

#of nodes:

� � 1

3…

2

2

���

��

�� 1

� � 1

���

����

�� 1

� � 1

���

��

�� 1

� � 1

���

��

�� 1

� � 1

���

��

�� 1

� � 1

Summation is not less than �

���� �� � �.

Page 18: Satoshi Yoshida* and Takuya Kida*

� Comparison◦ Total number of nodes (Exp1)◦ Compression times (Exp2)◦ Total number of nodes and upper bound and lower bound of nodes in VMA tree on random sequence (Exp3)

� Algorithms◦ YY coding (YY)◦ Encoding using a VMA tree (VMA)

� Environments◦ CPU: Intel Pentium® 4 Processor 3.0GHz Hyper Threading◦ Memory: 2GB◦ OS: Debian GNU/Linux 5.0◦ Language: C++◦ Compiler: g++4.3

� Codeword length: 12 bits

18

Page 19: Satoshi Yoshida* and Takuya Kida*

� The Canterbury Corpus†

19

file size (in bytes) k content

alice29.txt 152,089 74 English text

asyoulik.txt 125,179 68 Shakespeare

cp.html 24,603 86 HTML source

fields.c 11,150 90 C source

grammar.lsp 3,721 76 LISP source

kennedy.xls 1,029,744 256 Excel Spreadsheet

lcet10.txt 424,754 84 Technical writing

plrabn12.txt 481,861 81 Poetry

ptt5 513,216 159 CCITT test set

sum 38,240 255 SPARC Executable

xargs.1 4,227 74 GNU manual page†http://corpus.canterbury.ac.nz/descriptions

Page 20: Satoshi Yoshida* and Takuya Kida*

20

0

200

400

600

800

1000

1200

tota

l am

ou

nt o

f no

des

tota

l am

ou

nt o

f no

des

tota

l am

ou

nt o

f no

des

tota

l am

ou

nt o

f no

des

VMA

YY

103

k : 74 68 86 90 76 256 84 81 159 255 74

7 times

32 times

Page 21: Satoshi Yoshida* and Takuya Kida*

21

0

10

20

30

40

50

60

70

co

mp

ressio

n tim

e (s

ec)

co

mp

ressio

n tim

e (s

ec)

co

mp

ressio

n tim

e (s

ec)

co

mp

ressio

n tim

e (s

ec)

VMA

YY

k : 74 68 86 90 76 256 84 81 159 255 74

3 times

18 times

Page 22: Satoshi Yoshida* and Takuya Kida*

22

0

200

400

600

800

1000

1200

0 32 64 96 128 160 192 224 256

tota

l n

um

be

r o

f n

od

es

alphabet size

the number of nodes

on uniform distribution

VMA

YY

upper bound

lower bound

0

200

400

600

800

1000

1200

0 32 64 96 128 160 192 224 256

tota

l n

um

be

r o

f n

od

es

alphabet size

the number of nodes

on Zipf distribution

VMA

YY

upper bound

lower bound

Page 23: Satoshi Yoshida* and Takuya Kida*

� We showed that we can reduce the total numberof nodes in comparison to YY code by using a VMAtree.

� We also showed that we can reduce the totalnumber of nodes and times considerably inexperiments.

� We found an upper bound and a lower bound of thenumber of nodes in a VMA tree.

23

� Applying this idea to STVF coding [Kida, DCC2009]

� Finding tighter bounds