Upload
esteban-barkus
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Linear-time construction of CSAusing o(n log n)-bit working spacefor large alphabets
Joong Chae Na
School of Computer Sci. & Eng.Seoul National University, Korea
Overview
Background Suffix arrays(SA) Compressed suffix arrays (CSA)
Problem definition
Previous works
Our contributions
Description of our algorithm
Conclusions
Background (1)Given a string T of length n over an alphabet Σ,
Suffix array (SA) of T [Manber&Myers ’93]
Lexicographically sorted list of the suffixes of T
i SAT
1 9 $
2 8 a $
3 4 a a b b a $
4 2 a b a a b b a $
5 5 a b b a $
6 7 b a $
7 3 b a a b b a $
8 1 b a b a a b b a $
9 6 b b a $
T : b a b a a b b a $
O(n log n)-bits
Background (2)Compressed suffix array (CSA) [Grossi&Vitter ’00]
Compressed version of SA Space requirement
of O(n log|Σ|)-bit
FM-index [Ferragina&Manzini 2000]
i SAT ΨT
1 9 8 $
2 8 1 a $
3 4 5 a a b b a $
4 2 7 a b a a b b a $
5 5 9 a b b a $
6 7 2 b a $
7 3 3 b a a b b a $
8 1 4 b a b a a b b a $
9 6 6 b b a $
T : b a b a a b b a $
O(n log |Σ|)-bits
Problem definition
Constructing SA, CSA and FM-index using o(n log n)-time and o(n log n)-bit working space
Working space Temporary space required for executing an algorithm Not including the space for the input and output
Related works
Constructing SA and CSA ※ O(n log n)-bit working space
Manber & Myers [1993] : O(n log n)-time Kim et al. [2003] : O(n )-time Kärkkäinen & Sanders [2003] : O(n )-time Ko & Aluru [2003]: O(n )-time
※ O(n log |Σ| )-bit working space Lam et al. [COCOON 2002]: O(|Σ|n log n )-time Hon et al. [ISAAC 2003]: O(n log n )-time
None of these algorithms satisfy both time and space requirement of our problem.
Previous results
Hon et al. [FOCS 2003]
An algorithm using O(n loglog|Σ|)-time and
O(n log|Σ|)-bit working space
The first algorithm using o(n log n)-time and
o(n log n)-bit working space
following ½-recursion (the odd-even scheme)
Our contributions
Another algorithm using o(n log n)-time and
o(n log n)-bit working space O(n)-time and O(n log|Σ|·log|Σ|
αn)-bit working space α = log3 2 ≈ 0.63
The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(n log n)-bit working space
Following ⅔-recursion (the skew scheme)
Hon et al. vs. Our results
Hon et al. Our results
Time O(n loglog|Σ|) O(n)
Space (bit) O(n log|Σ|) O(n log|Σ|·log|Σ|αn)
Scheme ½-recursion ⅔-recursion
(merging) complex simple
(encoding)* implicit implicit
*The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.
Description of our algorithm
Overview
Preliminaries
Basic definitions and notations
Main technique
Outline of our algorithm
Preliminaries-Ψ function
T[k..n] : lexicographically the i th smallest suffix of T
■ SA[i] = k
■
i SAT ΨT
1 9 8 $
2 8 1 a $
3 4 5 a a b b a $
4 2 7 a b a a b b a $
5 5 9 a b b a $
6 7 2 b a $
7 3 3 b a a b b a $
8 1 4 b a b a a b b a $
9 6 6 b b a $
T : b a b a a b b a $
)1][ (if]1[
)1][ (if]1][[][
1
1
iSASA
iSAiSASAi
1 2 3 4 5 6 7 8 9
The position in SAwhere T[k+1..n] is stored
Preliminaries-Lemmas
Text, Ψ → SA, CSA O(n) time, O(n log|Σ|)-bit working space
Text, Ψ → C array (BWT) → FM-index O(n) time, O(n log|Σ|)-bit working space
Note : goal Text → Ψ
Hon et al. [FOCS 2003]
Basic def. and not. (1)
Residue-1 suffixes of T T[3i-2..n] for 1 ≤ i ≤ n/3 T[1..n], T[4..n], T[7..n],…
Residue-2 suffixes of T T[3i-1..n] for 1 ≤ i ≤ n/3 T[2..n], T[5..n], T[8..n],…
Residue-3 suffixes of T T[3i..n] for 1 ≤ i ≤ n/3 T[3..n], T[6..n], T[9..n],…
1 2 3 4 5 6 7 8 9
T[1..n] =
b a b a a b b a $
b a b a a b b a $
a a b b a $
b a $
a b a a b b a $
a b b a $
a $
b a a b b a $
b b a $
$
Basic def. and not. (2)
length : ⅔ nalphabet : Σ3
SA12 : suffix array of T12
length : ⅓ n
alphabet : Σ3
SA3 : suffix array of T3
1 2 3 4 5 6 7 8 9
T =
b a b a a b b a $
1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1
T12 = bab aab ba$ aba abb a$b3 4 5 6 7 8 9 1 2
T3 = baa bba $ba
alphabet Σ
T12 [1..⅔n] = T[1..n]T[2..n]T[1] T3 [1.. ⅓n] = T[3..n]T[1]T[2]
Main technique–Ψ’ function
Ψ’ is just like Ψ, but Ψ’ is defined in SA12 and SA3
Ψ’ points to the position in SA12 or SA3
where T[k+1..n] (the next suffix of current suffix T[k..n]) is stored.
※ Note that Ψ’ is not the Ψ-function of T12 and T3.
Ψ’-function consists of Ψ’T12, and Ψ’T3
Ψ’ function (residue-1)
Ψ’T12 (residue-1 suffixes of T) Let T[3k-2..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA12
where the next suffix T[3k-1..n] is stored.
Ψ’T12 (residue-2 suffixes of T)Let T[3k-1..n] be a suffix stored in SA12[i].
Then, Ψ’T12[i] is the position in SA3
where the next suffix T[3k..n] is stored.
Ψ’T3 (residue-3 suffixes of T)Let T[3k..n] be a suffix stored in SA3[i].
Then, Ψ’T3[i] is the position in SA12
where the next suffix T[3k+1..n] is stored.
Ψ’ function (residue-1)1 2 3 4 5 6 7 8 9
T = b a b a a b b a $
1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1
T12 =
bab aab ba$ aba abb a$b
3 4 5 6 7 8 9 1 2
T3 =
baa bba $ba
i SA12 Ψ’T12
1 6 1 a$b2 2 4 aab ba$
3 4 2 aba abb a$b
4 5 3 abb a$b5 3 1 ba$
6 1 3 bab aab ba$
i SA3 Ψ’T3
1 3 6 $ba
2 1 2 baa bba $ba
3 2 5 bba $ba
Ψ’ function (residue-2)
Ψ’T12 (residue-1 suffixes)Let T[3k-2..n] be a suffix stored in SA12[i].
Then, Ψ’T12[i] is the position in SA12
where the next suffix T[3k-1..n] is stored.
Ψ’T12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA12[i]. Then, Ψ’T12 [i] is the position in SA3
where the next suffix T[3k..n] is stored.
Ψ’T3 (residue-3 suffixes)Let T[3k..n] be a suffix stored in SA3[i].
Then, Ψ’T3[i] is the position in SA12
where the next suffix T[3k+1..n] is stored.
Ψ’ function (residue-2)1 2 3 4 5 6 7 8 9
T = b a b a a b b a $
1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1
T12 =
bab aab ba$ aba abb a$b
3 4 5 6 7 8 9 1 2
T3 =
baa bba $ba
i SA12 Ψ’T12
1 6 1 a$b2 2 4 aab ba$
3 4 2 aba abb a$b
4 5 3 abb a$b5 3 1 ba$
6 1 3 bab aab ba$
i SA3 Ψ’T3
1 3 6 $ba
2 1 2 baa bba $ba
3 2 5 bba $ba
Ψ’ function (residue-3)
Ψ’T12 (residue-1 suffixes)Let T[3k-2..n] be a suffix stored in SA12[i].
Then, Ψ’T12[i] is the position in SA12
where the next suffix T[3k-1..n] is stored.
Ψ’T12 (residue-2 suffixes)Let T[3k-1..n] be a suffix stored in SA12[i].
Then, Ψ’T12 [i] is the position in SA3
where the next suffix T[3k..n] is stored.
Ψ’T3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA3[i]. Then, Ψ’T3[i] is the position in SA12
where the next suffix T[3k+1..n] is stored.
Ψ’ function (residue-3)1 2 3 4 5 6 7 8 9
T = b a b a a b b a $
1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1
T12 =
bab aab ba$ aba abb a$b
3 4 5 6 7 8 9 1 2
T3 =
baa bba $ba
i SA12 Ψ’T12
1 6 1 a$b2 2 4 aab ba$
3 4 2 aba abb a$b
4 5 3 abb a$b5 3 1 ba$
6 1 3 bab aab ba$
i SA3 Ψ’T3
1 3 6 $ba
2 1 2 baa bba $ba
3 2 5 bba $ba
Framework- outline
How to construct Ψ function of T Bottom-up approach
Ψ
T ΨT
T12 ΨT12…Use any linear time construction algorithm
step 0
step 1
…
step h
h = log3log|Σ|n
nlength alphabe
t
n32
step i ni
32
3
i3
Step i - outline
S
S12 ΨS12 S3
ΨS12 (from step i+1)
Ψ’S12
Ψ’S3
→ Ψ’S12 Ψ’S3
ΨS
merge
ΨS
Merging step
i SA12 Ψ’T12
1 6 1 a$b2 2 4 aab ba$
3 4 2 aba abb a$b
4 5 3 abb a$b5 3 1 ba$
6 1 3 bab aab ba$
i SA3 Ψ’T3
1 3 6 $ba
2 1 2 baa bba $ba
3 2 5 bba $ba
i SAT ΨT
1 9 8 $2 8 1 a$3 5 5 aabba$4 2 7 abaabba$5 5 9 abba$6 7 2 ba$7 3 3 baabba$8 1 4 babaabba$9 6 6 bba$ba
* Comparing entries of SA12 with entries of SA3 in order
- compare two suffixes by following Ψ’-functoin at most twice
Conclusions & future works
We presented an alphabet-independent linear-time algorithm to construct SA, CSA, FM-index
using o(n log n)-bit working space
Future works To Construct SA, CSA, and FM-index optimally, i.e.,
using O(n)-time and O(n log|Σ|)-bit working space