11
1 A Modified Burrows- Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information Science University of Tokyo

1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

Embed Size (px)

Citation preview

Page 1: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

1

A Modified Burrows-Wheeler Transformation for

Case-insensitive Searchwith Application to

Suffix Array Compression

Kunihiko Sadakane

Department of Information Science

University of Tokyo

Page 2: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

2

Promising Techniques

• faster than PPMs• decoding is much faster• comparable performance w

ith PPMs

• search data structure• can find any substring• memory efficient than suffi

x trees

Block Sorting Compression [Burrows, Wheeler 94]

Suffix Array [Manber, Myers 93]

We unify compression and search by using them.

Key: the Burrows-Wheeler Transformation (BWT)

Page 3: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

3

Block Sorting Compression

• Burrows-Wheeler Transformation (BWT) performs permutation of text symbols in lexicographic order of their suffixes.

• Permuted text becomes more compressible.

Page 4: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

4

Novel Feature of the Block Sorting

• BWT is defined by the suffix array (sorted indexes of suffixes)

• The suffix array is recovered from the compressed text

Suffix array can be compressed by the Block Sorting!

But, it cannot be used for case-insensitive search.

Page 5: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

5

Our Contribution

• propose Modified Burrows-Wheeler Transformation– used for compressing text and its suffix array

• Decoded suffix array can be used for case-insensitive search.

• Any unification function is available. (not only case-insensitive search)

Page 6: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

6

An Application

Distributed Web Search Robots

search robot

collected text

compress byBlock Sorting

xyz XYZ

Web sites

transfer via network

search robot

Abc ABC

Web sites

Page 7: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

7Search Server

suffix array on disk

ABCAbc

decode

text

suffix array

merge into database

XYZxyz

transfer via network

3 10 8 5 2 7 ...14 2 8 3 9 5 10 ...

8 4 100 251 58 ...

Page 8: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

8

The original BWT

3 ABCAb c0 AbcAB C4 BCAbc A5 CAbcA B1 bcABC A2 cABCA b

AABCbc

Input text BWTed text

reverse BWT

0 AbcABC1 bcABCA2 cABCAb3 ABCAbc4 BCAbcA5 CAbcAB

sorting

BWT

304512

suffix array

Page 9: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

9

Unification

• unify capital/small letters (tolower)DCC = dcc

• unify double-byte codes and single-byte codes in Japanese EUC codeABC (a3c1 a3c2 a3c3) = ABC (41 42 43)

• unify Japanese Hiragana and Katakanaあいうえお = アイウエオ

We identify character equivalence.

Page 10: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

10

Modified BWT

3 abc$ c0 abcabc$ C4 bc$ A1 bcabc$ A5 c$ B2 cabc$ b

Input text

MBWTed text

reverseBWT

0 abcabc$1 bcabc$2 cabc$3 abc$4 bc$5 c$

sorting

MBWT

AbcABC

ccaabb

aabbcc

unify

unify

304152

suffix array

permutes symbols by suffix array of unified text

reverseMBWT

Page 11: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information

11

Compression Ratio and Speed

unification func.identical (BWT)normal (MBWT)LSB4MSB4zero (no BWT)

comp. ratio1.7431.7642.5232.7075.772

comp. time (s)363.58363.41443.89438.04411.74

HTML files (total 90Mbytes)Block size: 9Mbytes

•small difference between BWT and MBWT•MBWT provides case-insensitive searches.