Satoshi Yoshida* and Takuya Kida*

Satoshi Yoshida* and Takuya Kida**Hokkaido University Graduate school of Information Science and

Technology Division of Computer Science

1

※ESKM2012（9月20⽇〜9月22⽇）にて発表予定

2

Input Text

Fixed Length Variable Length

CompressedText

FixedLength

FF Code(Fixed length to Fixed

length code)

VF Code(Variable length to Fixed

length code)

VariableLength

FV Code(Fixed length to

Variable length code)

VV Code(Variable length to Variable

length code)

Huffman Huffman Code

TunstallTunstallCode

Program Searching on Compressed Data

011101011101110100101001011101011101110100101001・・・

Compressed DataSearch Directly

3

Tunstall code [Tunstall, 1967]

AIVF code using multiple parse trees[Yamamoto and Yokoo, 2001]

AIVF code [Yamamoto and Yokoo, 2001]

utilizing unused codewords

utilizing context between blocks

YY codeYY code

Improves

considerably

Improves compression ratio

considerably

4

Multiple parse trees

001011010010010001…

VMA tree

Huge time Huge time and memory

Proposing methodProposing method

relatively to the number of kind of characters (k) in the input text

k-1 parse trees

� We can reduce the total number of nodes in comparison to YY code by using a VMA tree.

� We can reduce the total number of nodes and compression time considerably also in experiments.

� We found an upper bound and a lower bound of the number of nodes in a VMA tree.

5

Number of nodes in VMA tree

Number of nodes reduced byintegration

Upper bound �� 1

2��

1

2� � � � 1

1

2� � 1 � � 2

� �� 1

� � 1� � � 2

Lower bound1

� � 1� ��

1

2��

1

2� � 3

6

k : alphabet sizeM : number of codewords

� Let Σ be a finite alphabet.

� Elements in alphabet are sorted in descending

order of their probabilities.

� We assume information source is memoryless.

� We simply say “encoding” encoding to the sequence

on {0, 1}.

7

aabbbbc Σ={a, b, c}

8

a

a

b

b

c

c a001

010

000

100

101

110011

111

a

a

Incomplete internal nodeAn internal node which doesn’thave k children.

input text:aabbbbc

compressed data sequence:001 101 101101 101 111

Each branch is labeled by a symbol

in Σ={a, b, c}.

Each leaf node and incomplete internal node has a codeword.

to the block.

Input text is parsed into blocks and we output

codeword corresponding to the block.

9

aa

b

b

c

ca

001 010000

100

101

111

011

a110

b c

a

a

b

b

c

c a001

010

000

100

101

110011

111

a

a

We switch multiple parse trees according to the context.

� Problem: time and space consuming◦ We construct k – 1 parse trees.

� We can reduce total number of nodes by sharing nodes.◦ We need to mark each node n in a VMA tree in order to tell which trees the node belongs to.◦ We can realize that by holding the least i such that nbelongs to Ti.

10

…0T1T

2−kT2T

VMA treemultiple parse trees

� Theorem

11

…

…

)(

1

i

iS +

)(

2

i

kS −

)(

1

i

kS −)(i

kS

)1(

2

+

+

i

iS)1(

2

+

−

i

kS)1(

1

+

−

i

kS)1( +i

kS

)(

2

i

iS +

Ti Ti +1

Let be the subtree of that consists of all the nodes under the node corresponding with , which is direct child of the root. Then completely covers . We denote this relation by .

)(i

jS iT

ja)1( +

+

i

jiS)(i

jiS +

)1()( +

++

i

ji

i

ji SS p

From this theorem, we can tell which trees a node belongs to easily.

12

…

…

…

…

…

…

…

0T 1T 2−kT

VT

)0(

1S)0(

2−kS)0(

1−kS

)0(

kS

)1(

2S)1(

2−kS)1(

1−kS)1(

kS

)2(

1

−

−

k

kS

)2( −k

kS

kS1−kS1S 2−kS

.SSS

,SSS

,SSS

,SSSS

,SSS

,SS

k

k

kk

k

k

kk

k

k

kk

=

=

=

=

=

=

−

−

−

−−

−

−

−−

)2()0(

1

)2(

1

)0(

1

2

)3(

2

)0(

2

3

)2(

3

)1(

3

)0(

3

2

)1(

2

)0(

2

1

)0(

1

pLp

pLp

pLp

M

pp

p

　

　

　

　　　　　

　

　

　

From the theorem, we have:

VMA tree

multiple parse trees

We call the integrated parse tree a VMA tree.

�Theorem

The number of reduced nodes is not less

than .

13

1

2��

1

2� � 3

1

2��

1

2� � 3

14

…

…

…

……

…

…

0T1T

2−kT

VT

)0(

1S

)0(

2−kS

)0(

1−kS

)0(

kS

)1(

2S

)1(

2−kS)1(

1−kS)1(

kS

)2(

1

−

−

k

kS

)2( −k

kS

)2( −k

kS)2(

1

−

−

k

kS)0(

1S)3(

2

−

−

k

kS

3−kT

)3(

2

−

−

k

kS

)3( −k

kS)3(

1

−

−

k

kS

)1(

3S)0(

2S

)1(

2S

Each reduced subtree has at least one node.

1−k 2−k 2

Summation:

�Theorem

The number of nodes in a VMA tree is

not less than .

15

1

� � 1� ��

16

…

…

��

��

� ��

� ��

��

Largest subtree and it remains in VMA tree.

Pr �� Pr �� ⋯ � Pr ��

The VMA tree is smallest when Pr �� Pr �� ⋯ � Pr �� .

17

…

…

VT

)2( −k

kS)2(

1

−

−

k

kS)0(

1S)3(

2

−

−

k

kS)1(

2S

#of codewords:

#of nodes:

�

�

�

� � 1

�

3…

�

2

�

2

��

��

�� 1

� � 1

��

��

�� 1

� � 1

��

��

�� 1

� � 1

��

��

�� 1

� � 1

��

��

�� 1

� � 1

Summation is not less than �

�� .

� Comparison◦ Total number of nodes (Exp1)◦ Compression times (Exp2)◦ Total number of nodes and upper bound and lower bound of nodes in VMA tree on random sequence (Exp3)

� Algorithms◦ YY coding (YY)◦ Encoding using a VMA tree (VMA)

� Environments◦ CPU: Intel Pentium® 4 Processor 3.0GHz Hyper Threading◦ Memory: 2GB◦ OS: Debian GNU/Linux 5.0◦ Language: C++◦ Compiler: g++4.3

� Codeword length: 12 bits

18

� The Canterbury Corpus†

19

file size (in bytes) k content

alice29.txt 152,089 74 English text

asyoulik.txt 125,179 68 Shakespeare

cp.html 24,603 86 HTML source

fields.c 11,150 90 C source

grammar.lsp 3,721 76 LISP source

kennedy.xls 1,029,744 256 Excel Spreadsheet

lcet10.txt 424,754 84 Technical writing

plrabn12.txt 481,861 81 Poetry

ptt5 513,216 159 CCITT test set

sum 38,240 255 SPARC Executable

xargs.1 4,227 74 GNU manual page†http://corpus.canterbury.ac.nz/descriptions

20

0

200

400

600

800

1000

1200

tota

l am

ou

nt o

f no

des

tota

l am

ou

nt o

f no

des

tota

l am

ou

nt o

f no

des

tota

l am

ou

nt o

f no

des

VMA

YY

103

k : 74 68 86 90 76 256 84 81 159 255 74

7 times

32 times

21

0

10

20

30

40

50

60

70

co

mp

ressio

n tim

e (s

ec)

co

mp

ressio

n tim

e (s

ec)

co

mp

ressio

n tim

e (s

ec)

co

mp

ressio

n tim

e (s

ec)

VMA

YY

k : 74 68 86 90 76 256 84 81 159 255 74

3 times

18 times

22

0

200

400

600

800

1000

1200

0 32 64 96 128 160 192 224 256

tota

l n

um

be

r o

f n

od

es

alphabet size

the number of nodes

on uniform distribution

VMA

YY

upper bound

lower bound

0

200

400

600

800

1000

1200

0 32 64 96 128 160 192 224 256

tota

l n

um

be

r o

f n

od

es

alphabet size

the number of nodes

on Zipf distribution

VMA

YY

upper bound

lower bound

� We showed that we can reduce the total numberof nodes in comparison to YY code by using a VMAtree.

� We also showed that we can reduce the totalnumber of nodes and times considerably inexperiments.

� We found an upper bound and a lower bound of thenumber of nodes in a VMA tree.

23

� Applying this idea to STVF coding [Kida, DCC2009]

� Finding tighter bounds

Documents

Satoshi Yoshida* and Takuya Kida*