26
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea

Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Embed Size (px)

Citation preview

Page 1: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Linear-time construction of CSAusing o(n log n)-bit working spacefor large alphabets

Joong Chae Na

School of Computer Sci. & Eng.Seoul National University, Korea

Page 2: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Overview

Background Suffix arrays(SA) Compressed suffix arrays (CSA)

Problem definition

Previous works

Our contributions

Description of our algorithm

Conclusions

Page 3: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Background (1)Given a string T of length n over an alphabet Σ,

Suffix array (SA) of T [Manber&Myers ’93]

Lexicographically sorted list of the suffixes of T

i SAT

1 9 $

2 8 a $

3 4 a a b b a $

4 2 a b a a b b a $

5 5 a b b a $

6 7 b a $

7 3 b a a b b a $

8 1 b a b a a b b a $

9 6 b b a $

T : b a b a a b b a $

O(n log n)-bits

Page 4: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Background (2)Compressed suffix array (CSA) [Grossi&Vitter ’00]

Compressed version of SA Space requirement

of O(n log|Σ|)-bit

FM-index [Ferragina&Manzini 2000]

i SAT ΨT

1 9 8 $

2 8 1 a $

3 4 5 a a b b a $

4 2 7 a b a a b b a $

5 5 9 a b b a $

6 7 2 b a $

7 3 3 b a a b b a $

8 1 4 b a b a a b b a $

9 6 6 b b a $

T : b a b a a b b a $

O(n log |Σ|)-bits

Page 5: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Problem definition

Constructing SA, CSA and FM-index using o(n log n)-time and o(n log n)-bit working space

Working space Temporary space required for executing an algorithm Not including the space for the input and output

Page 6: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Related works

Constructing SA and CSA ※ O(n log n)-bit working space

Manber & Myers [1993] : O(n log n)-time Kim et al. [2003] : O(n )-time Kärkkäinen & Sanders [2003] : O(n )-time Ko & Aluru [2003]: O(n )-time

※ O(n log |Σ| )-bit working space Lam et al. [COCOON 2002]: O(|Σ|n log n )-time Hon et al. [ISAAC 2003]: O(n log n )-time

None of these algorithms satisfy both time and space requirement of our problem.

Page 7: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Previous results

Hon et al. [FOCS 2003]

An algorithm using O(n loglog|Σ|)-time and

O(n log|Σ|)-bit working space

The first algorithm using o(n log n)-time and

o(n log n)-bit working space

following ½-recursion (the odd-even scheme)

Page 8: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Our contributions

Another algorithm using o(n log n)-time and

o(n log n)-bit working space O(n)-time and O(n log|Σ|·log|Σ|

αn)-bit working space α = log3 2 ≈ 0.63

The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(n log n)-bit working space

Following ⅔-recursion (the skew scheme)

Page 9: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Hon et al. vs. Our results

Hon et al. Our results

Time O(n loglog|Σ|) O(n)

Space (bit) O(n log|Σ|) O(n log|Σ|·log|Σ|αn)

Scheme ½-recursion ⅔-recursion

(merging) complex simple

(encoding)* implicit implicit

*The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.

Page 10: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Description of our algorithm

Page 11: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Overview

Preliminaries

Basic definitions and notations

Main technique

Outline of our algorithm

Page 12: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Preliminaries-Ψ function

T[k..n] : lexicographically the i th smallest suffix of T

■ SA[i] = k

i SAT ΨT

1 9 8 $

2 8 1 a $

3 4 5 a a b b a $

4 2 7 a b a a b b a $

5 5 9 a b b a $

6 7 2 b a $

7 3 3 b a a b b a $

8 1 4 b a b a a b b a $

9 6 6 b b a $

T : b a b a a b b a $

)1][ (if]1[

)1][ (if]1][[][

1

1

iSASA

iSAiSASAi

1 2 3 4 5 6 7 8 9

The position in SAwhere T[k+1..n] is stored

Page 13: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Preliminaries-Lemmas

Text, Ψ → SA, CSA O(n) time, O(n log|Σ|)-bit working space

Text, Ψ → C array (BWT) → FM-index O(n) time, O(n log|Σ|)-bit working space

Note : goal Text → Ψ

Hon et al. [FOCS 2003]

Page 14: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Basic def. and not. (1)

Residue-1 suffixes of T T[3i-2..n] for 1 ≤ i ≤ n/3 T[1..n], T[4..n], T[7..n],…

Residue-2 suffixes of T T[3i-1..n] for 1 ≤ i ≤ n/3 T[2..n], T[5..n], T[8..n],…

Residue-3 suffixes of T T[3i..n] for 1 ≤ i ≤ n/3 T[3..n], T[6..n], T[9..n],…

1 2 3 4 5 6 7 8 9

T[1..n] =

b a b a a b b a $

b a b a a b b a $

a a b b a $

b a $

a b a a b b a $

a b b a $

a $

b a a b b a $

b b a $

$

Page 15: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Basic def. and not. (2)

length : ⅔ nalphabet : Σ3

SA12 : suffix array of T12

length : ⅓ n

alphabet : Σ3

SA3 : suffix array of T3

1 2 3 4 5 6 7 8 9

T =

b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 = bab aab ba$ aba abb a$b3 4 5 6 7 8 9 1 2

T3 = baa bba $ba

alphabet Σ

T12 [1..⅔n] = T[1..n]T[2..n]T[1] T3 [1.. ⅓n] = T[3..n]T[1]T[2]

Page 16: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Main technique–Ψ’ function

Ψ’ is just like Ψ, but Ψ’ is defined in SA12 and SA3

Ψ’ points to the position in SA12 or SA3

where T[k+1..n] (the next suffix of current suffix T[k..n]) is stored.

※ Note that Ψ’ is not the Ψ-function of T12 and T3.

Ψ’-function consists of Ψ’T12, and Ψ’T3

Page 17: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Ψ’ function (residue-1)

Ψ’T12 (residue-1 suffixes of T) Let T[3k-2..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA12

where the next suffix T[3k-1..n] is stored.

Ψ’T12 (residue-2 suffixes of T)Let T[3k-1..n] be a suffix stored in SA12[i].

Then, Ψ’T12[i] is the position in SA3

where the next suffix T[3k..n] is stored.

Ψ’T3 (residue-3 suffixes of T)Let T[3k..n] be a suffix stored in SA3[i].

Then, Ψ’T3[i] is the position in SA12

where the next suffix T[3k+1..n] is stored.

Page 18: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Ψ’ function (residue-1)1 2 3 4 5 6 7 8 9

T = b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 =

bab aab ba$ aba abb a$b

3 4 5 6 7 8 9 1 2

T3 =

baa bba $ba

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

Page 19: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Ψ’ function (residue-2)

Ψ’T12 (residue-1 suffixes)Let T[3k-2..n] be a suffix stored in SA12[i].

Then, Ψ’T12[i] is the position in SA12

where the next suffix T[3k-1..n] is stored.

Ψ’T12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA12[i]. Then, Ψ’T12 [i] is the position in SA3

where the next suffix T[3k..n] is stored.

Ψ’T3 (residue-3 suffixes)Let T[3k..n] be a suffix stored in SA3[i].

Then, Ψ’T3[i] is the position in SA12

where the next suffix T[3k+1..n] is stored.

Page 20: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Ψ’ function (residue-2)1 2 3 4 5 6 7 8 9

T = b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 =

bab aab ba$ aba abb a$b

3 4 5 6 7 8 9 1 2

T3 =

baa bba $ba

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

Page 21: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Ψ’ function (residue-3)

Ψ’T12 (residue-1 suffixes)Let T[3k-2..n] be a suffix stored in SA12[i].

Then, Ψ’T12[i] is the position in SA12

where the next suffix T[3k-1..n] is stored.

Ψ’T12 (residue-2 suffixes)Let T[3k-1..n] be a suffix stored in SA12[i].

Then, Ψ’T12 [i] is the position in SA3

where the next suffix T[3k..n] is stored.

Ψ’T3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA3[i]. Then, Ψ’T3[i] is the position in SA12

where the next suffix T[3k+1..n] is stored.

Page 22: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Ψ’ function (residue-3)1 2 3 4 5 6 7 8 9

T = b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 =

bab aab ba$ aba abb a$b

3 4 5 6 7 8 9 1 2

T3 =

baa bba $ba

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

Page 23: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Framework- outline

How to construct Ψ function of T Bottom-up approach

Ψ

T ΨT

T12 ΨT12…Use any linear time construction algorithm

step 0

step 1

step h

h = log3log|Σ|n

nlength alphabe

t

n32

step i ni

32

3

i3

Page 24: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Step i - outline

S

S12 ΨS12 S3

ΨS12 (from step i+1)

Ψ’S12

Ψ’S3

→ Ψ’S12 Ψ’S3

ΨS

merge

ΨS

Page 25: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Merging step

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

i SAT ΨT

1 9 8 $2 8 1 a$3 5 5 aabba$4 2 7 abaabba$5 5 9 abba$6 7 2 ba$7 3 3 baabba$8 1 4 babaabba$9 6 6 bba$ba

* Comparing entries of SA12 with entries of SA3 in order

- compare two suffixes by following Ψ’-functoin at most twice

Page 26: Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Conclusions & future works

We presented an alphabet-independent linear-time algorithm to construct SA, CSA, FM-index

using o(n log n)-bit working space

Future works To Construct SA, CSA, and FM-index optimally, i.e.,

using O(n)-time and O(n log|Σ|)-bit working space