Upload
hugo-cameron
View
215
Download
0
Embed Size (px)
Citation preview
1
A Modified Burrows-Wheeler Transformation for
Case-insensitive Searchwith Application to
Suffix Array Compression
Kunihiko Sadakane
Department of Information Science
University of Tokyo
2
Promising Techniques
• faster than PPMs• decoding is much faster• comparable performance w
ith PPMs
• search data structure• can find any substring• memory efficient than suffi
x trees
Block Sorting Compression [Burrows, Wheeler 94]
Suffix Array [Manber, Myers 93]
We unify compression and search by using them.
Key: the Burrows-Wheeler Transformation (BWT)
3
Block Sorting Compression
• Burrows-Wheeler Transformation (BWT) performs permutation of text symbols in lexicographic order of their suffixes.
• Permuted text becomes more compressible.
4
Novel Feature of the Block Sorting
• BWT is defined by the suffix array (sorted indexes of suffixes)
• The suffix array is recovered from the compressed text
Suffix array can be compressed by the Block Sorting!
But, it cannot be used for case-insensitive search.
5
Our Contribution
• propose Modified Burrows-Wheeler Transformation– used for compressing text and its suffix array
• Decoded suffix array can be used for case-insensitive search.
• Any unification function is available. (not only case-insensitive search)
6
An Application
Distributed Web Search Robots
search robot
collected text
compress byBlock Sorting
xyz XYZ
Web sites
transfer via network
search robot
Abc ABC
Web sites
7Search Server
suffix array on disk
ABCAbc
decode
text
suffix array
merge into database
XYZxyz
transfer via network
3 10 8 5 2 7 ...14 2 8 3 9 5 10 ...
8 4 100 251 58 ...
8
The original BWT
3 ABCAb c0 AbcAB C4 BCAbc A5 CAbcA B1 bcABC A2 cABCA b
AABCbc
Input text BWTed text
reverse BWT
0 AbcABC1 bcABCA2 cABCAb3 ABCAbc4 BCAbcA5 CAbcAB
sorting
BWT
304512
suffix array
9
Unification
• unify capital/small letters (tolower)DCC = dcc
• unify double-byte codes and single-byte codes in Japanese EUC codeABC (a3c1 a3c2 a3c3) = ABC (41 42 43)
• unify Japanese Hiragana and Katakanaあいうえお = アイウエオ
We identify character equivalence.
10
Modified BWT
3 abc$ c0 abcabc$ C4 bc$ A1 bcabc$ A5 c$ B2 cabc$ b
Input text
MBWTed text
reverseBWT
0 abcabc$1 bcabc$2 cabc$3 abc$4 bc$5 c$
sorting
MBWT
AbcABC
ccaabb
aabbcc
unify
unify
304152
suffix array
permutes symbols by suffix array of unified text
reverseMBWT
11
Compression Ratio and Speed
unification func.identical (BWT)normal (MBWT)LSB4MSB4zero (no BWT)
comp. ratio1.7431.7642.5232.7075.772
comp. time (s)363.58363.41443.89438.04411.74
HTML files (total 90Mbytes)Block size: 9Mbytes
•small difference between BWT and MBWT•MBWT provides case-insensitive searches.