33
rything is String. Closed Factorization Golnaz Badkobeh 1 , Hideo Bannai 2 , Keisuke Goto 2 , Tomohiro I 2 , Costas S. Iliopoulos 3 , Shunsuke Inenaga 2 , Simon J. Puglisi 4 , and Shiho Sugimoto 2 1. University of Sheffield, United Kingdom 2. Kyushu University, Japan 3. King’s College London, United Kingdom 4. University of Helsinki, Finland

Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Embed Size (px)

Citation preview

Page 1: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Closed Factorization

Golnaz Badkobeh1, Hideo Bannai2, Keisuke Goto2,Tomohiro I2, Costas S. Iliopoulos3, Shunsuke Inenaga2,

Simon J. Puglisi4, and Shiho Sugimoto2

1. University of Sheffield, United Kingdom

2. Kyushu University, Japan

3. King’s College London, United Kingdom

4. University of Helsinki, Finland

Page 2: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• A closed string is a string with a proper substring that occurs as a prefix and a suffix but does not have internal occurrences [Fici, 2011].

Closed Strings

a b c a b a c a c b a a b c

a a a a a a a a a a a a a a Closing border

Page 3: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• A closed string is a string with a proper substring that occurs as a prefix and a suffix but does not have internal occurrences [Fici, 2011].

– A string of length 1 is closed, where the closing border is the empty string ε.

• A closed string has a unique closing border.

Closed Strings

a b c a b a c a c b a a b c

a a a a a a a a a a a a a a Closing border

Page 4: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• We introduce the Longest Closed Factor Array of a string w and an algorithm which computes it in O(n log n / loglog n) time and O(n) space.– n is the length of w.

• We introduce the Closed Factorization of a string w and the algorithm which compute it in O(n) time and space.– n is the length of w.

Our Contribution

Page 5: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Definition of Longest Closed Factor Array

w = a b a b a a c b b b c b c c $1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

• The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Page 6: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Definition of Longest Closed Factor Array

w = a b a b a a c b b b c b c c $

A =

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

• The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Page 7: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Definition of Longest Closed Factor Array

w = a b a b a a c b b b c b c c $

A = 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

• The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Page 8: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Definition of Longest Closed Factor Array

w = a b a b a a c b b b c b c c $

A = 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

• The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Page 9: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Definition of Longest Closed Factor Array

w = a b a b a a c b b b c b c c $

A = 5 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

• The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Page 10: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Definition of Longest Closed Factor Array

w = a b a b a a c b b b c b c c $

A = 5 4 3 5 2 1 6 3 2 4 3 1 2 1 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

• The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Page 11: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Theorem 1Given a string w of length n over an integer alphabet,the closed factor array of w can be computed in O(n log n / loglog n) time and O(n) space.

Computing Longest Closed Factor Array

Page 12: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Lemma 1The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i.

Computing Longest Closed Factor Array

w = a b a b a a c b b b c b c c $1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 13: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Lemma 1The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i.

Computing Longest Closed Factor Array

w = a b a b a a c b b b c b c c $1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 14: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Lemma 1The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i.

Computing Longest Closed Factor Array

w = a b a b a a c b b b c b c c $1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 15: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Lemma 1The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i.

Computing Longest Closed Factor Array

w = a b a b a a c b b b c b c c $1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 16: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

1. Construct and preprocess the suffix tree of w.

2. i 1.

3. Compute the closing border bi starting at position i.– with the suffix tree constructed in Step 1

4. Find the leftmost occurrence j of bi in w[i+1..n].– with a range successor query

5. A[i] j + |bi| – i.

6. i i +1.

7. Repeat Steps 3~5 until i = n.

Outline of Our Algorithm

Page 17: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Construct the suffix tree of a given string w.• Each leaf of the suffix tree stores the beginning

position of the suffix corresponding to the leaf.• Any internal node v of the suffix tree is labeled by the

maximum leaf value in the subtree rooted at v.

Step 1

a

aaba$

b

$$aba$

$ab

aba$$

1 23 4 56 7

4

6

5

w = abaaba$

SA

Page 18: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Outline of Our Algorithm

1. Construct and preprocess the suffix tree of w.

2. i 1.

3. Compute the closing border bi starting at position i.– with the suffix tree constructed in Step 1

4. Find the leftmost occurrence j of bi in w[i+1..n].– with a range successor query

5. A[i] j + |bi| – i.

6. i i +1.

7. Repeat Steps 3~5 until i = n.

Page 19: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Compute the closing border bi starting at position i.– Find the highest node x labeled i.– The path from the root to the parent of x

is the closing border of longest closed factor starting at position i.

Step 3

a

aaba$

b

$$aba$

$ab

aba$$

1 23 4 56 7

4

6

5

Suffix Tree of abaaba$

Page 20: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Step 3

root

i

t

i

x

uw

i

pathlabel(x)

pathlabel(u)

Suffix Tree of w

a

x : the highest node labeled it

u : the parent of x

t

How do we find node x?

Page 21: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Compute the closing border bi starting at position i.– Find the highest node x labeled i.

• Traverse the suffix tree from the root.– O(|x|) time for a constant alphabet.– O(|x| log n) time for an integer alphabet.

• An array P[1..n] enables us to find node x in O(1) time.– P[i] contains a pointer to node x in the tree for which i is the

maximum leaf value.– P can be computed in O(n) time with pre-order traversing.

Step 3

Page 22: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Outline of Our Algorithm

1. Construct and preprocess the suffix tree of w.

2. i 1.

3. Compute the closing border bi starting at position i.– with the suffix tree constructed in Step 1

4. Find the leftmost occurrence j of bi in w[i+1..n].– with a range successor query

5. A[i] j + |bi| – i.

6. i i +1.

7. Repeat Steps 3~5 until i = n.

Page 23: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

Step 4

root

i

i h

x

u

G

Suffix Tree of w

a

x : the highest node labeled i

t

t

wi

pathlabel(x)

pathlabel(u)

th

u : the parent of x

h is the successor of i in the set of the leaf values.

Page 24: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Compute the longest closed factor starting at position i.– Use a range successor query data structure for the suffix

array [Yu et al., 2011].• Each internal node v stores the beginning and ending

positions of the corresponding range in the suffix array.

Step 4

a

aaba$

b

$$aba$

$ab

aba$$

1 23 4 56 7

4

6

5Suffix Tree of a b a a b a $1 2 3 4 5 6 7

Page 25: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Compute the longest closed factor starting at position i.– Use a range successor query data structure for the suffix

array [Yu et al., 2011].• Each internal node v stores the beginning and ending

positions of the corresponding range in the suffix array.• Range successor query need O(log n / loglog n) time for

each position i.

Step 4

a

aaba$

b

$$aba$

$ab

aba$$

1 23 4 56 7

4

6

5Suffix Tree of a b a a b a $1 2 3 4 5 6 7

Page 26: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Given a string w of length n over an integer alphabet,the closed factor array of w can be computed in O(n log n / loglog n) time and O(n) space.

Our Result 1

Page 27: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• The closed factorization of string w of length n is a sequence (G0,G1,…,Gk) of strings such that G0 = ε, w = G1…Gk and, for each 1 ≤ j ≤ k, Gj is the longest closed prefix of w[|G1…Gj-1|+1..n].

Definition of Closed Factorization

a b a b a a c b b b c b c c $1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 28: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Theorem 2Given a string w of length n over an integer alphabet,the closed factorization of w can be computed in O(n) time and space.

Computing Closed Factorization

Page 29: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

1. Construct and preprocess the suffix tree of w.

2. i 1.

3. Compute the closing border bi starting at position i.– with the suffix tree constructed in Step 1

4. Find the leftmost occurrence j of bi in w[i+1..n].– with the KMP algorithm– Stop the KMP algorithm as soon as j is found.

5. i j + |bi|.

6. Repeat Steps 3~5 until i = n.

Outline of Our Algorithm

Page 30: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• We can compute each factor Gj in O(|Gj|) time with the KMP algorithm.

• Because the sum of the lengths of all factors is n, the total time to compute the closed factorization is O(n).

Algorithm of Closed Factorization

Page 31: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Given a string w of length n over an integer alphabet,the closed factorization of w can be computed in O(n) time and space.

Our Result 2

Page 32: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• We introduced the Longest Closed Factor Array of a string and proposed an algorithm which computes it in O(n log n / loglog n) time and O(n) space.

• We introduced the Closed Factorization of a string and proposed an algorithm which computes it in O(n) time and space.

Conclusion

Page 33: Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Everything is String.

• Can we efficiently compute the longest closed factor array without range successor queries?

• Can we find the longest closed factor containing each position without the longest closed factor array?

Open Problems