26
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea

Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

  • Upload
    archer

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Linear-time construction of CSA using o( n log n )-bit working space for large alphabets. Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea. Overview. Background Suffix arrays(SA) Compressed suffix arrays (CSA) Problem definition Previous works - PowerPoint PPT Presentation

Citation preview

Page 1: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Linear-time construction of CSAusing o(n log n)-bit working spacefor large alphabets

Joong Chae Na

School of Computer Sci. & Eng.Seoul National University, Korea

Page 2: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Overview

Background Suffix arrays(SA) Compressed suffix arrays (CSA)

Problem definition

Previous works

Our contributions

Description of our algorithm

Conclusions

Page 3: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Background (1)Given a string T of length n over an alphabet Σ,

Suffix array (SA) of T [Manber&Myers ’93]

Lexicographically sorted list of the suffixes of T

i SAT

1 9 $

2 8 a $

3 4 a a b b a $

4 2 a b a a b b a $

5 5 a b b a $

6 7 b a $

7 3 b a a b b a $

8 1 b a b a a b b a $

9 6 b b a $

T : b a b a a b b a $

O(n log n)-bits

Page 4: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Background (2)Compressed suffix array (CSA) [Grossi&Vitter ’00]

Compressed version of SA Space requirement

of O(n log|Σ|)-bit

FM-index [Ferragina&Manzini 2000]

i SAT ΨT

1 9 8 $

2 8 1 a $

3 4 5 a a b b a $

4 2 7 a b a a b b a $

5 5 9 a b b a $

6 7 2 b a $

7 3 3 b a a b b a $

8 1 4 b a b a a b b a $

9 6 6 b b a $

T : b a b a a b b a $

O(n log |Σ|)-bits

Page 5: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Problem definition

Constructing SA, CSA and FM-index using o(n log n)-time and o(n log n)-bit working space

Working space Temporary space required for executing an algorithm Not including the space for the input and output

Page 6: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Related works

Constructing SA and CSA ※ O(n log n)-bit working space

Manber & Myers [1993] : O(n log n)-time Kim et al. [2003] : O(n )-time Kärkkäinen & Sanders [2003] : O(n )-time Ko & Aluru [2003]: O(n )-time

※ O(n log |Σ| )-bit working space Lam et al. [COCOON 2002]: O(|Σ|n log n )-time Hon et al. [ISAAC 2003]: O(n log n )-time

None of these algorithms satisfy both time and space requirement of our problem.

Page 7: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Previous results

Hon et al. [FOCS 2003]

An algorithm using O(n loglog|Σ|)-time and

O(n log|Σ|)-bit working space

The first algorithm using o(n log n)-time and

o(n log n)-bit working space

following ½-recursion (the odd-even scheme)

Page 8: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Our contributions

Another algorithm using o(n log n)-time and

o(n log n)-bit working space O(n)-time and O(n log|Σ|·log|Σ|

αn)-bit working space α = log3 2 ≈ 0.63

The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(n log n)-bit working space

Following ⅔-recursion (the skew scheme)

Page 9: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Hon et al. vs. Our results

Hon et al. Our results

Time O(n loglog|Σ|) O(n)

Space (bit) O(n log|Σ|) O(n log|Σ|·log|Σ|αn)

Scheme ½-recursion ⅔-recursion

(merging) complex simple

(encoding)* implicit implicit

*The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.

Page 10: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Description of our algorithm

Page 11: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Overview

Preliminaries

Basic definitions and notations

Main technique

Outline of our algorithm

Page 12: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Preliminaries-Ψ function

T[k..n] : lexicographically the i th smallest suffix of T

■ SA[i] = k

i SAT ΨT

1 9 8 $

2 8 1 a $

3 4 5 a a b b a $

4 2 7 a b a a b b a $

5 5 9 a b b a $

6 7 2 b a $

7 3 3 b a a b b a $

8 1 4 b a b a a b b a $

9 6 6 b b a $

T : b a b a a b b a $

)1][ (if]1[

)1][ (if]1][[][

1

1

iSASA

iSAiSASAi

1 2 3 4 5 6 7 8 9

The position in SAwhere T[k+1..n] is stored

Page 13: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Preliminaries-Lemmas

Text, Ψ → SA, CSA O(n) time, O(n log|Σ|)-bit working space

Text, Ψ → C array (BWT) → FM-index O(n) time, O(n log|Σ|)-bit working space

Note : goal Text → Ψ

Hon et al. [FOCS 2003]

Page 14: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Basic def. and not. (1)

Residue-1 suffixes of T T[3i-2..n] for 1 ≤ i ≤ n/3 T[1..n], T[4..n], T[7..n],…

Residue-2 suffixes of T T[3i-1..n] for 1 ≤ i ≤ n/3 T[2..n], T[5..n], T[8..n],…

Residue-3 suffixes of T T[3i..n] for 1 ≤ i ≤ n/3 T[3..n], T[6..n], T[9..n],…

1 2 3 4 5 6 7 8 9

T[1..n] =

b a b a a b b a $

b a b a a b b a $

a a b b a $

b a $

a b a a b b a $

a b b a $

a $

b a a b b a $

b b a $

$

Page 15: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Basic def. and not. (2)

length : ⅔ nalphabet : Σ3

SA12 : suffix array of T12

length : ⅓ n

alphabet : Σ3

SA3 : suffix array of T3

1 2 3 4 5 6 7 8 9

T =

b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 = bab aab ba$ aba abb a$b3 4 5 6 7 8 9 1 2

T3 = baa bba $ba

alphabet Σ

T12 [1..⅔n] = T[1..n]T[2..n]T[1] T3 [1.. ⅓n] = T[3..n]T[1]T[2]

Page 16: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Main technique–Ψ’ function

Ψ’ is just like Ψ, but Ψ’ is defined in SA12 and SA3

Ψ’ points to the position in SA12 or SA3

where T[k+1..n] (the next suffix of current suffix T[k..n]) is stored.

※ Note that Ψ’ is not the Ψ-function of T12 and T3.

Ψ’-function consists of Ψ’T12, and Ψ’T3

Page 17: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Ψ’ function (residue-1)

Ψ’T12 (residue-1 suffixes of T) Let T[3k-2..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA12

where the next suffix T[3k-1..n] is stored.

Ψ’T12 (residue-2 suffixes of T)Let T[3k-1..n] be a suffix stored in SA12[i].

Then, Ψ’T12[i] is the position in SA3

where the next suffix T[3k..n] is stored.

Ψ’T3 (residue-3 suffixes of T)Let T[3k..n] be a suffix stored in SA3[i].

Then, Ψ’T3[i] is the position in SA12

where the next suffix T[3k+1..n] is stored.

Page 18: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Ψ’ function (residue-1)1 2 3 4 5 6 7 8 9

T = b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 =

bab aab ba$ aba abb a$b

3 4 5 6 7 8 9 1 2

T3 =

baa bba $ba

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

Page 19: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Ψ’ function (residue-2)

Ψ’T12 (residue-1 suffixes)Let T[3k-2..n] be a suffix stored in SA12[i].

Then, Ψ’T12[i] is the position in SA12

where the next suffix T[3k-1..n] is stored.

Ψ’T12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA12[i]. Then, Ψ’T12 [i] is the position in SA3

where the next suffix T[3k..n] is stored.

Ψ’T3 (residue-3 suffixes)Let T[3k..n] be a suffix stored in SA3[i].

Then, Ψ’T3[i] is the position in SA12

where the next suffix T[3k+1..n] is stored.

Page 20: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Ψ’ function (residue-2)1 2 3 4 5 6 7 8 9

T = b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 =

bab aab ba$ aba abb a$b

3 4 5 6 7 8 9 1 2

T3 =

baa bba $ba

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

Page 21: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Ψ’ function (residue-3)

Ψ’T12 (residue-1 suffixes)Let T[3k-2..n] be a suffix stored in SA12[i].

Then, Ψ’T12[i] is the position in SA12

where the next suffix T[3k-1..n] is stored.

Ψ’T12 (residue-2 suffixes)Let T[3k-1..n] be a suffix stored in SA12[i].

Then, Ψ’T12 [i] is the position in SA3

where the next suffix T[3k..n] is stored.

Ψ’T3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA3[i]. Then, Ψ’T3[i] is the position in SA12

where the next suffix T[3k+1..n] is stored.

Page 22: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Ψ’ function (residue-3)1 2 3 4 5 6 7 8 9

T = b a b a a b b a $

1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1

T12 =

bab aab ba$ aba abb a$b

3 4 5 6 7 8 9 1 2

T3 =

baa bba $ba

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

Page 23: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Framework- outline

How to construct Ψ function of T Bottom-up approach

Ψ

T ΨT

T12 ΨT12…Use any linear time construction algorithm

step 0

step 1

step h

h = log3log|Σ|n

nlength alphabe

t

n32

step i ni

32

3

i3

Page 24: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Step i - outline

S

S12 ΨS12 S3

ΨS12 (from step i+1)

Ψ’S12

Ψ’S3

→ Ψ’S12 Ψ’S3

ΨS

merge

ΨS

Page 25: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Merging step

i SA12 Ψ’T12

1 6 1 a$b2 2 4 aab ba$

3 4 2 aba abb a$b

4 5 3 abb a$b5 3 1 ba$

6 1 3 bab aab ba$

i SA3 Ψ’T3

1 3 6 $ba

2 1 2 baa bba $ba

3 2 5 bba $ba

i SAT ΨT

1 9 8 $2 8 1 a$3 5 5 aabba$4 2 7 abaabba$5 5 9 abba$6 7 2 ba$7 3 3 baabba$8 1 4 babaabba$9 6 6 bba$ba

* Comparing entries of SA12 with entries of SA3 in order

- compare two suffixes by following Ψ’-functoin at most twice

Page 26: Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Conclusions & future works

We presented an alphabet-independent linear-time algorithm to construct SA, CSA, FM-index

using o(n log n)-bit working space

Future works To Construct SA, CSA, and FM-index optimally, i.e.,

using O(n)-time and O(n log|Σ|)-bit working space