Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE 2012 @ Cartagena, Colombia

Efficient LZ78 factorization of grammar compressed text

Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

Kyushu University, Japan

SPIRE 2012 @ Cartagena, Colombia


Outline Background LZ78 Factorization Straight Line Programs (SLP) Algorithms

LZ78 factorization using suffix trees SLP to LZ78 Improvements


Background

Compresse

d Representation of String

BIG StringThis work: LZ78 factorization of grammar compressed strings

Compressed String Processing (CSP) compress string for storage … but …

don’t decompress all of it when using it! can be faster than processing the uncompressed text,

by exploiting regularities identified by compression regard compression as a generic preprocessing!

Pattern Matchingprocessdirectly

Edit DistancePattern Mining

etc.


LZ78 Factorization [Ziv&Lempel ’78]

The LZ78-factorization of string S is a factorizationS = f1 f2 ... fm

where fi is the longest prefix of fi ... fm such that

fi = fj c for some 0 ≤ j < i (let f0 = ε)

S = a l a b a r a l a l a b a r d a $

0

1

a

2

l

3

b

4

r5

l7

b

6

a8

d9

$

LZ78 trie of S

(0,a)f1

(0,l)f2

(1,b)f3

(1,r)f4

(1,l) f5

(5,a) f6

(0,b) f7

(5,d) f8

(1,$)f9

O(N log σ) timeO(m) space


Straight Line Programs

• CFG in Chomsky normal form that derives single string.• Can efficiently model outputs of many compression

algorithms: REPAIR, SEQUITUR, LZ78, etc.

Straight Line Program

X1 = aX2 = bX3 = X1 X2

X4 = X1 X3

X5 = X4 X3

X6 = X4 X5

X7 = X6 X5

SLP , n = 7 Derivation tree

S

X7

X2X1

X6

X2X1 X1 X3 X2X1

X1 X3 X4 X3

X4 X5

a a b a a b a b a a b a b

X2X1

X1 X3 X2X1

X4 X3

X5


Problem: SLP to LZ78Input: SLP Output: LZ78 Factorization (Trie)

X1 = a X5 = X4 X3

X2 = b X6 = X4 X5

X3 = X1 X2 X7 = X6 X5

X4 = X1 X3

0

15

2

3

4

6

a

a b

a

b

b

Why “re-compress” a compressed representation? Convert the representation

Some CSP algorithms require specific compression Re-compress an SLP modified by ad-hoc edits

Dynamic compressed texts Compute Normalized Compression Distance [Li et al. 2004]

Clustering & classification w/o decompression CLZ78 (x), CLZ78 (y), CLZ78(xy) from SLPs of x, y

ComputerScientist

Make Sleeping Files Walk in their Sleep!


Our Results

Algorithms to compute LZ78 from SLP

Algorithm Time Space

Direct (uncompressed) O(N logσ) O(m)Decompress + Direct O(N logσ) O(n+m)SLP (partial decompressions) O(nN½ + m log N) O(nN½ + m)SLP + Doubling O(nL + m log N) O(nL + m)SLP + Redundancy Reduction O(Nα + m log N) O(Nα + m)

N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α ≤ N m : # of LZ78 factors

(O(N/log N) for constant σ) α ≥ 0 is a quantity that represents the amount of redundancy in the string that is captured by the SLP


LZ78 Factorization using a Suffix Tree


Suffix Tree & LZ78

The LZ78 trie can be superimposed on the suffix tree

S

1 2 3 4 5 6 7 8 9 10 11 12 13

suffix tree of S LZ78 trie of S


10

a

5

8

7

9

12

1 4 2 3

13b

a

a

bab

a

11

6

ababaabab

b

aabab

babab

aabab

aabab

abab

aabab

b

aabab

0

1

3 2

5

6

4

a

a b

a

b

b0

1

3 2

5

6

4

a

a b

a

b

b


10

a

5

8

7

9

12

1 4 2 3

13b

a

a

bab

a

11

6

ababaabab

b

aabab

babab

aabab

aabab

abab

aabab

b

aabab

31

2

LZ78 Factorization on Suffix Tree

a a b a a b a b a a b a bS

1 2 3 4 5 6 7 8 9 10 11 12 13

0

5

4

6

Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked

Find longest prefix of S[i:N] in LZ78 trie O(1) time by dynamic nearest marked ancestor queries [Westbrook, ‘92]

Make new node of LZ78 trie on ST O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94]

Compute next position i i + |fi|

LZ78 factorization in O(m) time,given suffix tree preprocessed for nma & la queries

i

Next factor is prefix of S[i:N].Find node in ST corresponding to S[i:N]


SLP to LZ78


Our algorithm: SLP to LZ78

We only need a suffix tree that contains all distinct substrings of S with length at most cN

Build GST from a set of substrings of S that contain all distinct length-cN substrings of S

Main Idea

For any string of length N, the length of any LZ78 factor fi satisfies:

|fi| ≤ cN = (2N+¼)½ – ½ = O(N½)

Key Observation


Important Concept: Stabbing

Xi stabs an interval [u:v] of S, when it is the shortest variable that derives the interval(any interval is stabbed by a unique variable)

X1 = aX2 = bX3 = X1 X2

X4 = X1 X3

X5 = X4 X3

X6 = X4 X5

X7 = X6 X5

e.g.: aaba at [9:12] is stabbed by X5 X7

X2X1

X6

X2X1 X1 X3 X2X1

X1 X3 X4 X3

X4 X5


X2X1

X1 X3 X2X1

X4 X3

X5

1 2 3 4 5 6 7 8 9 10 11 12 13


Substrings stabbed by Xi

All length-q substrings stabbed by Xi are contained in a string ti(q) of length at most 2(q – 1)

Xl(i)Xr(i)

Xi

q – 1

q

q – 1

qAny length-q substring of Sis stabbed by some unique variable Xi , and therefore is a substring of some ti(q)

{ ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n } will contain all distinctlength-cN substrings of S

ti(q)


LZ78 Factorization from SLP

Algorithm:

1. Compute { ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n }

2. Build generalized suffix tree (GST)for strings{ ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n }

3. Run LZ78 Factorization algorithm using GST

O(ncN) time/space


Example N = 13, cN = 4, n = 7

{ t5(4), t6(4), t7(4) } = { aabab, aabaab, babaab }

S

X7

X2X1

X6

X2X1 X1 X3 X2X1

X1 X3 X4 X3

X4 X5


X2X1

X1 X3 X2X1

X4 X3

X5

1 2 3 4 5 6 7 8 9 10 11 12 13


GST & LZ78 Factors

The LZ78 trie superimposed on GST of {t5(4), t6(4), t7(4)}


1 2 3 4 5 6 7 8 9 10 11 12 13

a

ab

a

b

a

bb

b

a

a

38,14

b

7,13

9,154,10,16

5,11,17

16

ab b

2

3

12

a

bab

GST of {t5(4),t6(4),t7(4)} LZ78 trie of S

0

13

2

5

6

4

a

a b

a

b

b0

13

2

5

6

4

a

a b

a

b

b

a a b a b a a b a a b b a b a a bt5(4) 　　 t6(4) 　　 t7(4)

1 2 3 4 5 6 7 8 91011121314151617


Find longest prefix of S[i:N] in LZ78 trie

Make new node for LZ78 trie on STCompute next position i i + |fi|

Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]


1 2 3 4 5 6 7 8 91011121314151617

S

X7

X2X1

X6

X2X1 X1 X3 X2X1

X1 X3 X4 X3

X4 X5


X2X1

X1 X3 X2X1

X4 X3

X5

1 2 3 4 5 6 7 8 9 10 11 12 13

a

ab

a

b

a

bb

b

a

a

38,14

b

7,13

9,154,10,16

5,11,17

16

ab b

2

3

12

a

bab

1

LZ78 Factorization on GST

0

cN = 4i

O(log N) time w/ random accesson SLP [Bille et al. 2011]O(1) time w/ dynamic nma queriesO(1) time w/ dynamic nma queries



1 2 3 4 5 6 7 8 91011121314151617

a

ab

a

b

a

bb

b

a

a

38,14

b

7,13

9,154,10,16

5,11,17

16

ab b

2

3

12

a

bab

1

2


0

S

X7

X2X1

X6

X2X1 X1 X3 X2X1

X1 X3 X4 X3

X4 X5


X2X1

X1 X3 X2X1

X4 X3

X5

1 2 3 4 5 6 7 8 9 10 11 12 13

cN = 4i







1 2 3 4 5 6 7 8 91011121314151617

S

X7

X2X1

X6

X2X1 X1 X3 X2X1

X1 X3 X4 X3

X4 X5


X2X1

X1 X3 X2X1

X4 X3

X5

1 2 3 4 5 6 7 8 9 10 11 12 13

a

ab

a

b

a

bb

b

a

a

38,14

b

7,13

9,154,10,16

5,11,17

16

ab b

2

3

12

a

bab

13

2


0

cN = 4i

LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries






Summary of Basic Algorithm

Extreme Cases: If the string is compressible, n = O(log N), m = O(N½), so

O(ncN + m log N) = O(N½ log N) = o(N) If the string is not compressible, n, m = O(N) and

O(ncN + m log N) = O(N1.5)


Direct (uncompressed) O(N log σ) O(m)Decompress + Direct O(N log σ) O(n+m)

SLP O(ncN + m log N) O(ncN + m)

cN = O(N½)

can we do better than just revert to decompress & process?


(1) Improving ncN term to nL ≤ ncN

Let L denote length of longest LZ78 factor of S We built GST for distinct substrings of length at most cN

but actually, we only need substrings of length at most L However, L is not known beforehand…

O(ncN + mlogN) time, O(ncN + m) space

O(nL + mlogN) time, O(nL + m) space

Assume L = 2 and run algorithm. If LZ78 trie expands beyond GST,

L 2×L, rebuild GST and LZ78 trie, and continue Total time complexity for rebuild:

Σi=1..log LO(n2i+m) = O(nL+mlogL)

Doubling Technique:


(2) Improving ncN term to Nα ≤ N

We can replace GST with suffix tree of trie for q = cN

Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of size Nα = N – α(q) ≤ N, whereα(q) = Σi:|Xi| ≥ q (vOcc(Xi) – 1) (|ti(q)| – (q – 1)) ≥ 0

vOcc(Xi) : # of times Xi occurs in derivation tree

Lemma [Goto et al. CPM 2012]

The suffix tree of a reverse trie can be constructed in linear time.

Lemma [Shibuya 2003]

O(ncN + mlogN) time, O(ncN + m) space

O(Nα + mlogN) time, O(Nα + m) space

The trie can be computed in time linear of its size.

Nα = O(ncN)


Example: Trie of size Nα for q = 4

X7

X2X1

X6

X2X1 X1 X3 X2X1

X1 X3 X4 X3

X4 X5


a a b a b

a a b

b a b

X2X1

X1 X3 X2X1

X4 X3

X5

Σ|ti(q)| : 17Text size: 13Trie size: 11

We can aggregate all ti(q) intoa trie of size at most the text size


Summary Showed algorithm for SLP LZ78 factorization

at least as fast as naïve decompress & process better when string is compressible


Direct (uncompressed) O(N logσ) O(m)Decompress + Direct O(N logσ) O(n+m)SLP (partial decompressions) O(nN½ + m log N) O(nN½ + m)SLP + Doubling O(nL + m log N) O(nL + m)SLP + Redundancy Reduction O(Nα + m log N) O(Nα + m)

N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α(cN) ≤ N m : # of LZ78 factors

(O(N/log N) for constant σ)

Documents

Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE 2012 @ Cartagena, Colombia