Upload
amos-mason
View
222
Download
0
Embed Size (px)
Citation preview
Efficient LZ78 factorization of grammar compressed text
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Kyushu University, Japan
SPIRE 2012 @ Cartagena, Colombia
SPIRE 2012 @ Cartagena, Colombia
Outline Background LZ78 Factorization Straight Line Programs (SLP) Algorithms
LZ78 factorization using suffix trees SLP to LZ78 Improvements
SPIRE 2012 @ Cartagena, Colombia
Background
Compresse
d Representation of String
BIG StringThis work: LZ78 factorization of grammar compressed strings
Compressed String Processing (CSP) compress string for storage … but …
don’t decompress all of it when using it! can be faster than processing the uncompressed text,
by exploiting regularities identified by compression regard compression as a generic preprocessing!
Pattern Matchingprocessdirectly
Edit DistancePattern Mining
etc.
SPIRE 2012 @ Cartagena, Colombia
LZ78 Factorization [Ziv&Lempel ’78]
The LZ78-factorization of string S is a factorizationS = f1 f2 ... fm
where fi is the longest prefix of fi ... fm such that
fi = fj c for some 0 ≤ j < i (let f0 = ε)
S = a l a b a r a l a l a b a r d a $
0
1
a
2
l
3
b
4
r5
l7
b
6
a8
d9
$
LZ78 trie of S
(0,a)f1
(0,l)f2
(1,b)f3
(1,r)f4
(1,l) f5
(5,a) f6
(0,b) f7
(5,d) f8
(1,$)f9
O(N log σ) timeO(m) space
SPIRE 2012 @ Cartagena, Colombia
Straight Line Programs
• CFG in Chomsky normal form that derives single string.• Can efficiently model outputs of many compression
algorithms: REPAIR, SEQUITUR, LZ78, etc.
Straight Line Program
X1 = aX2 = bX3 = X1 X2
X4 = X1 X3
X5 = X4 X3
X6 = X4 X5
X7 = X6 X5
SLP , n = 7 Derivation tree
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
SPIRE 2012 @ Cartagena, Colombia
Problem: SLP to LZ78Input: SLP Output: LZ78 Factorization (Trie)
X1 = a X5 = X4 X3
X2 = b X6 = X4 X5
X3 = X1 X2 X7 = X6 X5
X4 = X1 X3
0
15
2
3
4
6
a
a b
a
b
b
Why “re-compress” a compressed representation? Convert the representation
Some CSP algorithms require specific compression Re-compress an SLP modified by ad-hoc edits
Dynamic compressed texts Compute Normalized Compression Distance [Li et al. 2004]
Clustering & classification w/o decompression CLZ78 (x), CLZ78 (y), CLZ78(xy) from SLPs of x, y
ComputerScientist
Make Sleeping Files Walk in their Sleep!
SPIRE 2012 @ Cartagena, Colombia
Our Results
Algorithms to compute LZ78 from SLP
Algorithm Time Space
Direct (uncompressed) O(N logσ) O(m)Decompress + Direct O(N logσ) O(n+m)SLP (partial decompressions) O(nN½ + m log N) O(nN½ + m)SLP + Doubling O(nL + m log N) O(nL + m)SLP + Redundancy Reduction O(Nα + m log N) O(Nα + m)
N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α ≤ N m : # of LZ78 factors
(O(N/log N) for constant σ) α ≥ 0 is a quantity that represents the amount of redundancy in the string that is captured by the SLP
SPIRE 2012 @ Cartagena, Colombia
Suffix Tree & LZ78
The LZ78 trie can be superimposed on the suffix tree
S
1 2 3 4 5 6 7 8 9 10 11 12 13
suffix tree of S LZ78 trie of S
a a b a a b a b a a b a b
10
a
5
8
7
9
12
1 4 2 3
13b
a
a
bab
a
11
6
ababaabab
b
aabab
babab
aabab
aabab
abab
aabab
b
aabab
0
1
3 2
5
6
4
a
a b
a
b
b0
1
3 2
5
6
4
a
a b
a
b
b
SPIRE 2012 @ Cartagena, Colombia
10
a
5
8
7
9
12
1 4 2 3
13b
a
a
bab
a
11
6
ababaabab
b
aabab
babab
aabab
aabab
abab
aabab
b
aabab
31
2
LZ78 Factorization on Suffix Tree
a a b a a b a b a a b a bS
1 2 3 4 5 6 7 8 9 10 11 12 13
0
5
4
6
Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked
Find longest prefix of S[i:N] in LZ78 trie O(1) time by dynamic nearest marked ancestor queries [Westbrook, ‘92]
Make new node of LZ78 trie on ST O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94]
Compute next position i i + |fi|
LZ78 factorization in O(m) time,given suffix tree preprocessed for nma & la queries
i
Next factor is prefix of S[i:N].Find node in ST corresponding to S[i:N]
SPIRE 2012 @ Cartagena, Colombia
Our algorithm: SLP to LZ78
We only need a suffix tree that contains all distinct substrings of S with length at most cN
Build GST from a set of substrings of S that contain all distinct length-cN substrings of S
Main Idea
For any string of length N, the length of any LZ78 factor fi satisfies:
|fi| ≤ cN = (2N+¼)½ – ½ = O(N½)
Key Observation
SPIRE 2012 @ Cartagena, Colombia
Important Concept: Stabbing
Xi stabs an interval [u:v] of S, when it is the shortest variable that derives the interval(any interval is stabbed by a unique variable)
X1 = aX2 = bX3 = X1 X2
X4 = X1 X3
X5 = X4 X3
X6 = X4 X5
X7 = X6 X5
e.g.: aaba at [9:12] is stabbed by X5 X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
SPIRE 2012 @ Cartagena, Colombia
Substrings stabbed by Xi
All length-q substrings stabbed by Xi are contained in a string ti(q) of length at most 2(q – 1)
Xl(i)Xr(i)
Xi
q – 1
q
q – 1
qAny length-q substring of Sis stabbed by some unique variable Xi , and therefore is a substring of some ti(q)
{ ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n } will contain all distinctlength-cN substrings of S
ti(q)
SPIRE 2012 @ Cartagena, Colombia
LZ78 Factorization from SLP
Algorithm:
1. Compute { ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n }
2. Build generalized suffix tree (GST)for strings{ ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n }
3. Run LZ78 Factorization algorithm using GST
O(ncN) time/space
SPIRE 2012 @ Cartagena, Colombia
Example N = 13, cN = 4, n = 7
{ t5(4), t6(4), t7(4) } = { aabab, aabaab, babaab }
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
SPIRE 2012 @ Cartagena, Colombia
GST & LZ78 Factors
The LZ78 trie superimposed on GST of {t5(4), t6(4), t7(4)}
a a b a a b a b a a b a bS
1 2 3 4 5 6 7 8 9 10 11 12 13
a
ab
a
b
a
bb
b
a
a
38,14
b
7,13
9,154,10,16
5,11,17
16
ab b
2
3
12
a
bab
GST of {t5(4),t6(4),t7(4)} LZ78 trie of S
0
13
2
5
6
4
a
a b
a
b
b0
13
2
5
6
4
a
a b
a
b
b
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
SPIRE 2012 @ Cartagena, Colombia
Find longest prefix of S[i:N] in LZ78 trie
Make new node for LZ78 trie on STCompute next position i i + |fi|
Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
a
ab
a
b
a
bb
b
a
a
38,14
b
7,13
9,154,10,16
5,11,17
16
ab b
2
3
12
a
bab
1
LZ78 Factorization on GST
0
cN = 4i
O(log N) time w/ random accesson SLP [Bille et al. 2011]O(1) time w/ dynamic nma queriesO(1) time w/ dynamic nma queries
SPIRE 2012 @ Cartagena, Colombia
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
a
ab
a
b
a
bb
b
a
a
38,14
b
7,13
9,154,10,16
5,11,17
16
ab b
2
3
12
a
bab
1
2
LZ78 Factorization on GST
0
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
cN = 4i
Find longest prefix of S[i:N] in LZ78 trie
Make new node for LZ78 trie on STCompute next position i i + |fi|
Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]
O(log N) time w/ random accesson SLP [Bille et al. 2011]O(1) time w/ dynamic nma queriesO(1) time w/ dynamic nma queries
SPIRE 2012 @ Cartagena, Colombia
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
a
ab
a
b
a
bb
b
a
a
38,14
b
7,13
9,154,10,16
5,11,17
16
ab b
2
3
12
a
bab
13
2
LZ78 Factorization on GST
0
cN = 4i
LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries
Find longest prefix of S[i:N] in LZ78 trie
Make new node for LZ78 trie on STCompute next position i i + |fi|
Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]
O(log N) time w/ random accesson SLP [Bille et al. 2011]O(1) time w/ dynamic nma queriesO(1) time w/ dynamic nma queries
SPIRE 2012 @ Cartagena, Colombia
Summary of Basic Algorithm
Extreme Cases: If the string is compressible, n = O(log N), m = O(N½), so
O(ncN + m log N) = O(N½ log N) = o(N) If the string is not compressible, n, m = O(N) and
O(ncN + m log N) = O(N1.5)
Algorithm Time Space
Direct (uncompressed) O(N log σ) O(m)Decompress + Direct O(N log σ) O(n+m)
SLP O(ncN + m log N) O(ncN + m)
cN = O(N½)
can we do better than just revert to decompress & process?
SPIRE 2012 @ Cartagena, Colombia
(1) Improving ncN term to nL ≤ ncN
Let L denote length of longest LZ78 factor of S We built GST for distinct substrings of length at most cN
but actually, we only need substrings of length at most L However, L is not known beforehand…
O(ncN + mlogN) time, O(ncN + m) space
O(nL + mlogN) time, O(nL + m) space
Assume L = 2 and run algorithm. If LZ78 trie expands beyond GST,
L 2×L, rebuild GST and LZ78 trie, and continue Total time complexity for rebuild:
Σi=1..log LO(n2i+m) = O(nL+mlogL)
Doubling Technique:
SPIRE 2012 @ Cartagena, Colombia
(2) Improving ncN term to Nα ≤ N
We can replace GST with suffix tree of trie for q = cN
Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of size Nα = N – α(q) ≤ N, whereα(q) = Σi:|Xi| ≥ q (vOcc(Xi) – 1) (|ti(q)| – (q – 1)) ≥ 0
vOcc(Xi) : # of times Xi occurs in derivation tree
Lemma [Goto et al. CPM 2012]
The suffix tree of a reverse trie can be constructed in linear time.
Lemma [Shibuya 2003]
O(ncN + mlogN) time, O(ncN + m) space
O(Nα + mlogN) time, O(Nα + m) space
The trie can be computed in time linear of its size.
Nα = O(ncN)
SPIRE 2012 @ Cartagena, Colombia
Example: Trie of size Nα for q = 4
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a bS
a a b a b
a a b
b a b
X2X1
X1 X3 X2X1
X4 X3
X5
Σ|ti(q)| : 17Text size: 13Trie size: 11
We can aggregate all ti(q) intoa trie of size at most the text size
SPIRE 2012 @ Cartagena, Colombia
Summary Showed algorithm for SLP LZ78 factorization
at least as fast as naïve decompress & process better when string is compressible
Algorithm Time Space
Direct (uncompressed) O(N logσ) O(m)Decompress + Direct O(N logσ) O(n+m)SLP (partial decompressions) O(nN½ + m log N) O(nN½ + m)SLP + Doubling O(nL + m log N) O(nL + m)SLP + Redundancy Reduction O(Nα + m log N) O(Nα + m)
N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α(cN) ≤ N m : # of LZ78 factors
(O(N/log N) for constant σ)