Upload
alban-carpenter
View
228
Download
0
Embed Size (px)
Citation preview
1
Suffix tree and suffix array techniques for pattern analysis
in strings
Esko UkkonenUniv Helsinki
Erice School 30 Oct 2005
Modified Alon Itai 2006
2
Pattern finding & synthesis problems• T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite
alphabet
• Indexing problem: Preprocess T (build an index structure) such that the occurrences of different patterns P can be found fast– static text, any given pattern P
• Pattern synthesis problem: Learn from T new patterns that occur surprisingly often
• What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, …
4
Suffix array: example
• suffix array = lexicographic order of the suffixes
hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
ε
ε
atti
attivatti
hattivatti
i
ivatti
ti
tivatti
tti
ttivatti
vatti
11
7
2
1
10
5
9
4
8
3
6
5
Suffix array
• suffix array SA(T) = an array giving the lexicographic order of the suffixes of T
• space requirement: 5|T| 5למה ?
• practitioners like suffix arrays (simplicity, space efficiency)
• theoreticians like suffix trees (explicit structure)
6
Pattern search from suffix arrayhattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
ε
ε
atti
attivatti
hattivatti
i
ivatti
ti
tivatti
tti
ttivatti
vatti
11
7
2
1
10
5
9
4
8
3
6
att binary search
7
• The search time is O(m log n), where
m = length of search string,
n = length of text (and size of suffix array).
With LCA = longest common ancestor
time = O(m + log n).
pat
l u
l = m
m
pat
l um
U = m
pat
l um
8
Recent suffix array constructions
• Manber&Myers (1990): O(|T|log|T|)
• linear time via suffix tree
• January / June 2003: direct linear time construction of suffix array
- Kim, Sim, Park, Park (CPM03)- Kärkkäinen & Sanders (ICALP03)- Ko & Aluru (CPM03)
9
Kärkkäinen-Sanders algorithm
1.Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.
2.Construct the suffix array of the remaining suffixes using the result of the first step.
3.Merge the two suffix arrays into one.
10
Notation
• string T = T[0,n) = t0t1 … tn-1
• suffix Si = T[i,0) = titi+1 … tn-1
• for C [0,n]: SC = {Si | i in C}
• suffix array SA[0,n] of T is a permutation of
[0,n] satisfying SSA[0] < SSA[1] < … < SSA[n]
T[SA[0],n)
11
Running example
• T[0,n) = y a b b a d a b b a d o 0 0
• SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)
0 1 2 3 4 5 6 7 8 9 10 11
12 00 8 b a d o 0 0
1 a b b a d a b b a d o 0 0 2 b b a d a b b a d o 0 0
6 a b b a d o 0 0 7 b b a d o 0 0
4 a d a b b a d o 0 0 5 d a b b a d o 0 0
9 a d o 0 0 10 d o 0 0
3 b a d a b b a d o 0 0 11 o 0 0
0 y a b b a d a b b a d o 0 0
12
Step 0: Construct a sample
• for k = 0,1,2 Bk = {i є [0,n] | i mod 3 =
k}
• C = B1 U B2 sample positions
• SC sample suffixes
• Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}
13
Step 1: Sort sample suffixes
• for k = 1,2, construct
Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2]
R = R1 º R2 (concatenation of R1 and R2)
Suffixes of R correspond to SC:
suffix [titi+1ti+2]… corresponds to Si ; The correspondence is order preserving:
Let Ri’Si and Rj’Sj. Then Ri’< Rj’ iff Si < Sj
14
Sort the suffixes of RRadix sort the characters and rename with ranks
to obtain R´. Example:R1 R2 R = [abb][ada][bba][do0] [bba][dab][bad][o00] 1 2 3 4 5 6 7
[abb][ada][bad][bba] [dab] [do0] [o00]R´ = (1,2,4,6,4,5,3,7)
If all characters are different, their order directly gives the order of suffixes.
Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders.
Note: |R´| = 2n/3.
15
Step 1 (cont.)• Once the sample suffixes are sorted, assign a rank to
each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0
• Example: R´ = (1,2,4,6,4,5,3,7)
0: ε 3: 37 6: 537
1:12464537 4: 4537 7: 64537
2:24645,7 5: 464537 8: 7
SAR´ = (8,0,1,6,4,2,5,3,7) (The suffix array for R’)
SAR´-1 = (1 2 5 7 4 6 3 8)
rank(Si) (– 1 4 – 2 6 – 5 3 – 7 8 – 0 0 )
16
Step 2: Sort nonsample suffixes
• for each non-sample Si є SB0 (note that rank(Si+1) is always defined for i є B0):
Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1))
• radix sort the pairs (ti,rank(Si+1)).
• Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
17
יש לפרט יותר
Example: S12 < S6 < S9 < S3 < S0 because
S0 = yabbadabbado = yS1=(y,
S3 = badabbado = bS4=(b,
S6 = abbado = aS7=(a
S9 =ado = aS10=(a
S12=0 = 0eps = (0,0) (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
18
Step 3: Merge• merge the two sorted sets of suffixes using a
standard comparison-based merging:• to compare Si є SC with Sj є SB0, distinguish two
cases:
• i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1))• i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))
• note that the ranks are defined in all cases!• S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) <
(b,a,7)
B1 B2
19
Running time O(n)
• excluding the recursive call, everything can be done in linear time
• the recursion is on a string of length 2n/3
• thus the time is given by recurrenceT(n) = T(2n/3) + O(n)
• hence T(n) = O(n)
21
LCP table
• Longest Common Prefix of successive elements of suffix array:
• LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1]
• Algorithm:
• Enter the suffixes in a trie
• Find the lca.
• Complexity = O(n2)
22
Kasai et al, CPM2001Key observation:
Let LCP[q]=h>1, i.e., S SA[q] = titi+1…ai+h-1ti+h
S SA[q+1]= tktk+1…tk+h-1tk+h
= titi+1…ti+h-1ti+h ( tk+h≠ti+h)• Then ti+1…ti+h-1=tk+1…tk+h-1,.
• Define p SSA[p] =ti+1…ti+h-1…
therefore SSA[p+1]=ti+1…ti+h-1 …
• i.e., LCP[p] ≥ h-1• When computing LCP[p] we can start the comparisons at position p+h-1.
23
The algorithmfor(i=0; i<n; i++) /* compute SA-1 */
SA-1[SA[i]] = i;h = 0;for(p=0; p<n; p++) {
if(SA-1[p] > 0){r = SA [SA-1 [p]+1] ;while(T[r+h] = T[p+h])
h++;LCP[SA-1 [p]] = h;if(h > 0)
h--;}
}
Complexity:Since h is decreased at most n times, and h ≤ n,h can be increased at most 2n times;i.e., the innermost statement is executed ≤ 2n times.Total time = O(n).
innermost statement