Upload
sean-sturgess
View
220
Download
2
Embed Size (px)
Citation preview
Shift-And Approach to Pattern Matching
in LZW Compressed Text
Takuya KIDA
Department of InformaticsKyushu University, Japan
Masayuki TAKEDA
Ayumi SHINOHARA
Setsuo ARIKAWA
<2/32>
Address book
Schedule
Dictionary
Phone numbers
Memo
Electronic book
Database
The available storage devices are limited! I am eager to stuff any available information up to possible! I want to do pattern matching as fast as possible!
Motivation
Motivation
...Yes! Data compression!
...but a suffix trie is very large...
<3/32>
CompressedText
OriginalOriginalTextText
CompressedText
Pattern MatchingPattern Matching MachineMachine
New Machine !New Machine !
Our goal
Our goal
decompress
<4/32>
year researchers compression method
1988 Eliam-Tsoreff and Vishkin run-length
1992 Amir, Landau, and Vishkin two-dimensional run-length
1995 Farach and Thorup LZ77
1996 Amir, Benson and Farach LZW
1997 Karpinski, Rytter, and Shinohara straight-line programs
1996 Gasieniec, et al. LZ77
1997 Miyazaki, Shinohara, and Takeda straight-line programs
1992 Amir and Benson two-dimensional run-length
Amir, Benson, and Farach1994 two-dimensional run-length
1997 Takeda finite state encoding
1998 Shibata byte pair encoding
1994 Manber original compression scheme
1998 Fukamachi, Shinohara, and Takeda Huffman encoding
1998 Kida, et al. LZW
Previous researches
Previous researches
AC automatonAC automatonDCC’98DCC’98
<5/32>
year researchers compression method
1999 Kida, Takeda, Shinohara, andArikawa
LZW
1999 Shibata, et al. Byte pair encoding
Kida, et al.1999 Dictionary based methods(Collage system)
1999 Navarro and Raffinot LZ family
1999 Shibata, Takeda, Shinohara, andArikawa
Antidictionaries
CPM’99CPM’99
CPM’99CPM’99
CPM’99CPM’99
SPIRE’99SPIRE’99
1998 de Moura, Navarro, Ziviani, andBaeza-Yates
Word based encoding
Previous researches
Recent researches
Shift-And algorithmShift-And algorithm
<6/32>
Main results
The new algorithm scans a compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.
The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton.
The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm.
Our main results
|D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences
<8/32>
LZW compression
a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2
Original text:
Compressed text:
Dictionary trieb
a b c
a
a a
a
bb
b c
0
1 2 3
4 5
6 7
9
8 12
10
11
aba6
6
a
a
b
Lempel-Ziv-Welch(LZW) compression
O(|D|) = O(n)O(|D|) = O(n)
<9/32>
Move of compression
a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2
Original text:
Compressed text:
Dictionary trie
a b c0
1 2 3b
4a5
a6
b7
b8
c9
a10
b11
a12
How to compress a text
<10/32>
Move of decompression
1 2 34 5 6 9 114 2Original text:
Compressed text:
How to decompress a compressed text
a b ab ab ba b c aba bc abab
Dictionary trie
a b c0
1 2 3b
4a5
a6
b7
b8
c9
a10
b11
a12
O(n) timeO(n) time
O(N) timeO(N) time
Compressed Pattern Matchingin LZW Compressed Text
Compressed Pattern Matchingin LZW Compressed Text
with Shift-And approach
<12/32>
Shift-And approach to pattern matching
10000
abac
a
aabaacaabacabtext:pattern: aabac
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
10000
11000
11000
11010
&
a a b a c abc11010
00100
00001
mask bits
abac
a
Shift-And approach to pattern matching
Pattern was found!
(Baeza-Yates and Gonnet[1992], Wu and Manber[1992])
<13/32>
Property of SA approach
Properties of Shift-And approach
Simple, but very fast when a pattern length m is not greater than the word length of typical computers (32 or 64).
Assuming m32 (or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time.
This method has many variations generalized pattern matching pattern matching with k-mismatch pattern matching for multiple patterns
<14/32>
aabaacaabacab
abac
atext:
Basic idea
10000
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
a ab aa ac a a b a c
Jump! Jump!
pattern: aabac
Basic idea of our algorithm
abc11010
00100
00001
mask bits
10000
11000
10000
6 151compressedtext :
O(1) time?O(1) time?
<15/32>
Basic idea
aabaacaabacab
abac
atext:
10000
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
abc11010
00100
00001
mask bits
10000
11000
10000
We need a mechanism for reporting all pattern occurrences.
pattern: aabac
6 151compressedtext :
Pattern was found!
1
Basic idea of our algorithm
<16/32>
Main results
Lemma 1 (Realization of ‘Jump’)The state transition function can be realized in O(|D|+m) time using O(|D|) space, and return the value in O(1) time.
Lemma 2 (Realization of ‘Output ’)The procedure which enumerates the pattern occurrences can be realized in O(|D|+m) time using O(|D|) space, and run in O(r) time.
Technical details
|D| : size of the dictionary trie m : pattern length r : number of pattern occurrences
<17/32>
Overview of the algorithm
Overview of the algorithm
Input. pattern P, u1,u2, …,un : LZW compressed text.Output. All occurrences of the patterns.
^
^Construct mask bits from P.Initialize the dictionary trie, M, U, and V;
l:=0; S:=;
for i:=1 to n do begin for each dOutput(S, ui) do report ‘pattern occurs at position l+d ’;
S:= f (S, u); /* Jump the state! */ l:= l+ |ui|; /* increment the offset */
Update the dictionary trie, M, U, and V;end
^
<19/32>
Detail of ‘Jump’
for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•
Detail of ‘Jump’
10000
11000
11010
&
state transition
10100
state S={1,3}M(a)={1,2,4}M(b)={3}M(c)={5}
abc11010
00100
00001
abac
a
mask bits
f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }
bit shiftbit shift OROR ANDAND
<20/32>
Detail of ‘Jump’
f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }
for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•
f f ((SS, , uu) = (() = ((S S ||uu|)|)∪∪{1,{1, ・・・・・・ , , |u||u|}) }) ∩ ∩ MM((uu))^^ ^^
O(1)O(1)
Detail of ‘Jump’
M(u) :: f({1,・・・ , m}, u)M(u) :: f({1,・・・ , m}, u)^^ ^^
definerecursively
f f ((SS,,εε) :) : SS f f ((SS, , uaua) :) : f f ( ( f f ((SS, , uu), ), aa))
^^^^ ^^
<21/32>
Move of ‘Jump’
aba10010
abac
aacaabac
00001
M(u)^10000
100
10010
10010
&
10000
abac
aaabaacaabacabtext:
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
Move of f (S, u)^
111
<22/32>
10000
aba10010
abac
aacaabac
00001
M(u)^
Move of ‘Jump’
Move of f (S, u)^
00001
00001
&
10000
abac
aaabaacaabacabtext:
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
111111
<23/32>
Detail of updating Mhat(u)
How to calculate M(u)^
MM((u u aa)) = f({1,・・・ , m}, u a)^^ ^
= f ( f({1,・・・ , m}, u), a )^
= f ( M(u), a )^
= ((((MM((uu)) 1)1)∪∪{1}){1})∩∩MM((aa))^
u a
u
a
Dictionary trie D
M(u)^
M(u a)^
O(1)O(1)
total:O(|D|) time and spacetotal:O(|D|) time and space
<24/32>
Detail of Output(S,u)
Output(S, u) = { 1 j |u| | m∈S }
How to enumerate the occurrences
2
11
Output(S, u) ={ 2, 11}Output(S, u) ={ 2, 11}
uS
length i prefix of the pattern for the largest i∈S.
patternoccurrence
patternoccurrence
2{1, ...,m}D2{1, ...,m}D
<25/32>
Two subset U and A
U(u) : {1 j |u| | i < m and u[1..i]=Pattern[m-i+1..m]}
V(u) : {1 j |u| | i m and u[1-m+1..i]=Pattern}
Output(S, u) =((m S) U(u)) V(u)
Realization of Output(S, u)
dependent on S independent of S
uS
<26/32>
Detail of updating U and A
How to calculate U(u) and V(u)
u a
u
a
Dictionary trie DU(ua)V(ua)
U(u)V(u)
total:O(|D|) time and spacetotal:O(|D|) time and space
if m∈M(ua) then U(ua) = U(u) {|u a|}else U(ua) = U(u) ;
^
We can deal with V(n) as the same way of [DCC’98].
O(1)O(1)
-- Is this really practical? --
But... Is it But... Is it really fast ?really fast ?
Uhmm....Uhmm....
<28/32>
Experimentation
◆ Method 1:
◆ Method 2:
CompressedText bcbababc 9
CompressedText
Shift-And
Our previousalgorithm(DCC’98)
◆ Method 3:
Experimental Comparisons
Decompress !
CompressedText
Our new algorithms
<29/32>
Experimentation
Original Text"The Brown corpus"
6.8 Mbytes
Compressed Text
3.4 Mbytes
Language: C (with gcc compiler)Machine : Sun SPARCstation 20 with
remote disk storageFile transfer ratio: 0.96 Mbyte/sec
compresscompress(UNIX command)(UNIX command)
Experimental Comparisons
<30/32>
Experimental results
Experimental results
uncompressedtext
Shift-And
CPU time + File I/O time
1.3 timesfaster!
1.5 timesfaster!
elapsed time(s)
6.05
7.31
8.16
CPU time(s)
Shift-And with decompression
Our previous algorithm(DCC’98)
New algorithmNew algorithm
7.52
6.57
5.15
Method
<31/32>
Experimental results
Experimental results
Shift-And in original text 9.363.09
elapsed time(s)
6.05
7.31
8.16
CPU time(s)
Shift-And with decompression
Our previous algorithm(DCC’98)
New algorithmNew algorithm
7.52
6.57
5.15
Method
<32/32>
Conclusion
Conclusion
The proposed algorithm scans an LZW compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.
We implemented the algorithm, and showed that it is approximately 1.3 times faster than our previous algorithm.
Our new algorithm has several extensions. generalized pattern matching pattern matching with k-mismatches pattern matching for multiple patterns