40
1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249 Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

  • Upload
    zilya

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249. Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu. - PowerPoint PPT Presentation

Citation preview

Page 1: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

1

Approximate String Matching Using Compressed Suffix Arrays

Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249

Advisor: Prof. R. C. T. Lee

Speaker: C. W. Lu

Page 2: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

2

• Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y.

• k-difference string matching problem:– Given a text T with length n, a pattern P with lengt

h m, and an error bound k.– Find all position i of T such that there exists an suf

fix S of T(1, i), d(S, P) ≦ k.

Page 3: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

3

• The approach of this paper is as the follows:

• Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P.

• Then we conduct an exact match of all such P’s against T.

Page 4: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

4

• Example:

T=abbaaa,

P=aba and k=1.

From P and k, we generate the following P’s:

ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

Page 5: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

5

• Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k.

• How can we generate all P’s which we want?

• We use the following observation.

Page 6: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

6

T

P

S2

Let S be a substring of T, and S= S1S2.

P = P1P2.

If d(S1, P1) ≦k, and Dist(S2, P2) = 0,

d(S, P) ≦ k.

S1

S

P1 P2

Page 7: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

7

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(6, 11) = AAAACA,

Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

Page 8: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

8

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(8, 11) = AACA,

Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

Page 9: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

9

• Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner.

• Consider P=aba, k=1.

Page 10: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

10

• P=aba, k=1.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

Page 11: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

11

• P=aba, k=2.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

Page 12: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

12

• P=aba, k=2.

ba

(k = 1)

a (Deletion) k = 2i = 2 aba (Insertion) k = 2

bba (Insertion) k = 2

aa (Substution) k = 2

ba k = 1

i = 3

b (Deletion) k = 2baa (Insertion) k = 2bba (Insertion) k = 2

bb (Substution) k = 2

ba k = 1

i = 4

baa (Insertion) k = 2

bab (Insertion) k = 2

Page 13: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

13

For i=1 to m+1

PL’ PR’P’

k’=Dist(PL’, PL)≦k.

Dist(PR’, PR) = 0

iPL’ PR’

P’

iPL

PR

P

Deletion, k’++

A

PL’ PR’

P’

CP’…

Replacement , k’++

A

PL’ PR’

P’

CP’…

Insertion, k’++

PL’ PR’

P’ No operation.

i

Terminate if k’ > k.

Page 14: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

14

• Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not.

• For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

Page 15: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

15

• This exact matching can be found by using the suffix array and the inverse suffix array.

Page 16: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

16

Suffix Array

• Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A.

• The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj.

• The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.

nn- t...tttT 110

Page 17: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

17

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Suffixes of T:

{GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $}

Lexicographic order:

$, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$.

= T9, T1, T3, T2, T7, T8, T0, T4, T6, T5

SA[i]

9 1 3 2 7 8 0 4 6 5

0 1 2 3 4 5 6 7 8 9i

Page 18: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

18

Inverse Suffix Array

• The inverse suffix array of T is denoted as SA-1[i].• SA-1[i] equals the number of suffix which are

lexicographically smaller then Ti.

Page 19: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

19

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i SA-1[i]6

1

3

2

7

9

8

4

5

0

SA-1[SA[x] ] = x.

SA-1[0]=6 because there are 6 suffixes smaller than T0=

GACAGTTCG.

Page 20: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

20

• The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

Page 21: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

21

• In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed].

We write [st..ed ] = range(T, P).

Page 22: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

22

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P = G.

G is a prefix of T8, T0 and T4.

T8 = TSA[5]

T0 = TSA[6]

T4 = TSA[7]

st=5, ed=7,

range(T, P) = [5..7].

Page 23: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

23

Lemma 1 (Gusfild [12])

Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.

Page 24: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

24

Lemma 2

Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.

Page 25: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

25

Let [st1..ed1] = range(T , P1),

[st2..ed2] = range(T , P2),

[st..ed] = range(T , P1P2).

[st..ed] is a subinterval of [st1..ed1].

Page 26: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

26

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

iP1 = G. P2 = A.

range(T, P1) = [5..7].

range(T, P1P2) must be

within [5..7].

How can we find the

exact interval with [5..7]?

Page 27: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

27

• By the definition of suffix array, the lexicographic order of are increasing.

• The lexicographic order of

are also increasing.

][]1[][ 111 edSAstSAstSA , ..., T, TT

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

Page 28: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

28

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

T2 = CAGTTCG$

T2+1 = T3 = AGTTCG$

T2+1 is obtained by deleting the prefix with length 1 from T2.

In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti.

Page 29: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

29

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [5..7].

][]1[][ 111 edSAstSAstSA , ..., T, TT

T8 < T0 < T4

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

T8+1, T0+1, T4+1

T9 < T1 < T5

Page 30: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

30

• The lexicographic order of

are also increasing.

• Thus

• To find st and ed, we find the smallest st such that and the largest ed such that

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

|]|][[ ... |]|1][[ |]|][[ 11-1

11-1

11-1 PedSASAPstSASAPstSASA

21-1

2 |]|][[ edPstSASAst . |]|][[ 21

-12 edPedSASAst

Page 31: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

31

Example:T G A C A G A T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)ATCG$. (T5)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GATCG$

(T4)TCG$

(T6)

SA[i]9

1

3

5

2

7

8

0

4

6

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [6..8].

6 ≦ st, ed ≦ 8

SA-1[i]7

1

4

2

8

3

9

5

6

0

range(T, P2) = [1..3].

range(T, P1P2) = [st..ed].

st = 7 and ed = 8.

3 1 1 1, 1][7][-1 SASA

3 3 1 3, 1][8][-1 SASA

1 0 0, 1][6][-1 SASA

Page 32: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

32

• To find the interval of the first character of P:

We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c.

range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.

Page 33: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

33

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i

P = GACAGCA

C[A] = 2

C[C] = 4

C[G] = 7

C[T] = 9

range(T, p1)

= [C[C]+1…C[G] ]

= [5…7].

Page 34: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

34

• Lemma 3

Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.

Page 35: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

35

I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]).II Call kapproximate([0..n], 1, 0, ε, ε).

kapproximate([s’..e’], i, k’, PL’, Υ )begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j ≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ;end

Page 36: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

36

• After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.

Page 37: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

37

References

• [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc.

• Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192.

• [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on

• Discrete Algorithms, 2000, pp. 794–803.• [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Pro

c. Seventh Ann. Symp. on Combinatorial Pattern• Matching (CPM’96), pp. 1–23.• [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLE

I, vol. 1, November 1997, pp. 273–282.• [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772.• [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products.

in: ESA 2000, pp. 120–131.• [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Co

mbinatorial Pattern Matching (CPM’95), Lecture• Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54.• [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and

don’t cares, in: Proc. 36th Ann. ACM Symp. on• Theory of Computing, 2004, pp. 91–100.• [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IE

EE Symp. on Foundations of Computer Science• (FOCS’00), 2000, pp. 390–398.

Page 38: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

38

• [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland,

• 1992.• [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text i

ndexing and string matching, in: Proc. 32nd ACM• Symp. on Theory of Computing, 2000, pp. 397–406.• [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Compu

tational Biology, Cambridge University Press,• Cambridge, 1997.• [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing fu

ll-text indices, in: Proc. IEEE Symp. on Foundations• of Computer Science, 2003.• [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. i

n: Proc. MFCS’91, Lecture Notes in Computer Science,• vol. 520, Springer, Berlin, 1991, pp. 240–248.• [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM

2003, pp. 186–199.• [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (197

7) 323–350.• [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, p

p. 200–210.• [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorit

hms 10 (1989) 157–169.• [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. C

omput. 22 (5) (1993) 935–948.

Page 39: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

39

• [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272.

• [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88.

• [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern

• Matching (CPM’99), pp. 163–185.• [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matchin

g, J. Discrete Algorithms 1 (1) (2000) 205–239 18.• [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate stri

ng matching, IEEE Data Eng. Bull. 24 (4) (2001)• 19–27.• [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, i

n: Proc. 11th Ann. Symp. on Combinatorial Pattern• Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000.• [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems,

Genome Informatics 12 (2001) 175–183.• [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South

American Workshop on String Processing (WSP’96),• Carleton University Press, 1996.• [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc.

Seventh Ann. Symp. on Combinatorial Pattern Matching• (CPM’96), pp. 50–63.• [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Ma

tching 1993, vol. 4, Springer, Berlin, June 1993,• pp. 228–242.• [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 16

8–173.

Page 40: Advisor: Prof. R. C. T. Lee  Speaker: C. W. Lu

40

Thank you!