Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

1

Approximate String Matching Using Compressed Suffix Arrays

Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249

Advisor: Prof. R. C. T. Lee

Speaker: C. W. Lu

2

• Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y.

• k-difference string matching problem:– Given a text T with length n, a pattern P with lengt

h m, and an error bound k.– Find all position i of T such that there exists an suf

fix S of T(1, i), d(S, P) ≦ k.

3

• The approach of this paper is as the follows:

• Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P.

• Then we conduct an exact match of all such P’s against T.

4

• Example:

T=abbaaa,

P=aba and k=1.

From P and k, we generate the following P’s:

ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

5

• Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k.

• How can we generate all P’s which we want?

• We use the following observation.

6

T

P

S2

Let S be a substring of T, and S= S1S2.

P = P1P2.

If d(S1, P1) ≦k, and Dist(S2, P2) = 0,

d(S, P) ≦ k.

S1

S

P1 P2

7

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(6, 11) = AAAACA,

Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

8

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(8, 11) = AACA,

Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

9

• Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner.

• Consider P=aba, k=1.

10

• P=aba, k=1.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

11

• P=aba, k=2.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

12

• P=aba, k=2.

ba

(k = 1)

a (Deletion) k = 2i = 2 aba (Insertion) k = 2

bba (Insertion) k = 2

aa (Substution) k = 2

ba k = 1

i = 3

b (Deletion) k = 2baa (Insertion) k = 2bba (Insertion) k = 2

bb (Substution) k = 2

ba k = 1

i = 4

baa (Insertion) k = 2

bab (Insertion) k = 2

13

For i=1 to m+1

PL’ PR’P’

k’=Dist(PL’, PL)≦k.

Dist(PR’, PR) = 0

iPL’ PR’

P’

iPL

PR

P

Deletion, k’++

A

PL’ PR’

P’

CP’…

Replacement , k’++

A

PL’ PR’

P’

CP’…

Insertion, k’++

PL’ PR’

P’ No operation.

i

Terminate if k’ > k.

14

• Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not.

• For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

15

• This exact matching can be found by using the suffix array and the inverse suffix array.

16

Suffix Array

• Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A.

• The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj.

• The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.

nn- t...tttT 110

17

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Suffixes of T:

{GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $}

Lexicographic order:

$, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$.

= T9, T1, T3, T2, T7, T8, T0, T4, T6, T5

SA[i]

9 1 3 2 7 8 0 4 6 5

0 1 2 3 4 5 6 7 8 9i

18

Inverse Suffix Array

• The inverse suffix array of T is denoted as SA-1[i].• SA-1[i] equals the number of suffix which are

lexicographically smaller then Ti.

19

Example:


0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i SA-1[i]6

1

3

2

7

9

8

4

5

0

SA-1[SA[x] ] = x.

SA-1[0]=6 because there are 6 suffixes smaller than T0=

GACAGTTCG.

20

• The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

21

• In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed].

We write [st..ed ] = range(T, P).

22

Example:


0 1 2 3 4 5 6 7 8 9



(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P = G.

G is a prefix of T8, T0 and T4.

T8 = TSA[5]

T0 = TSA[6]

T4 = TSA[7]

st=5, ed=7,

range(T, P) = [5..7].

23

Lemma 1 (Gusfild [12])

Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.

24

Lemma 2

Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.

25

Let [st1..ed1] = range(T , P1),

[st2..ed2] = range(T , P2),

[st..ed] = range(T , P1P2).

[st..ed] is a subinterval of [st1..ed1].

26

Example:


0 1 2 3 4 5 6 7 8 9



(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

iP1 = G. P2 = A.

range(T, P1) = [5..7].

range(T, P1P2) must be

within [5..7].

How can we find the

exact interval with [5..7]?

27

• By the definition of suffix array, the lexicographic order of are increasing.

• The lexicographic order of

are also increasing.

][]1[][ 111 edSAstSAstSA , ..., T, TT

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

28



(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

T2 = CAGTTCG$

T2+1 = T3 = AGTTCG$

T2+1 is obtained by deleting the prefix with length 1 from T2.

In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti.

29

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9


(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [5..7].

][]1[][ 111 edSAstSAstSA , ..., T, TT

T8 < T0 < T4


T8+1, T0+1, T4+1

T9 < T1 < T5

30

• The lexicographic order of

are also increasing.

• Thus

• To find st and ed, we find the smallest st such that and the largest ed such that


|]|][[ ... |]|1][[ |]|][[ 11-1

11-1

11-1 PedSASAPstSASAPstSASA

21-1

2 |]|][[ edPstSASAst . |]|][[ 21

-12 edPedSASAst

31

Example:T G A C A G A T C G $

0 1 2 3 4 5 6 7 8 9


(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)ATCG$. (T5)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GATCG$

(T4)TCG$

(T6)

SA[i]9

1

3

5

2

7

8

0

4

6

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [6..8].

6 ≦ st, ed ≦ 8

SA-1[i]7

1

4

2

8

3

9

5

6

0

range(T, P2) = [1..3].

range(T, P1P2) = [st..ed].

st = 7 and ed = 8.

3 1 1 1, 1][7][-1 SASA

3 3 1 3, 1][8][-1 SASA

1 0 0, 1][6][-1 SASA

32

• To find the interval of the first character of P:

We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c.

range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.

33

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9


(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i

P = GACAGCA

C[A] = 2

C[C] = 4

C[G] = 7

C[T] = 9

range(T, p1)

= [C[C]+1…C[G] ]

= [5…7].

34

• Lemma 3

Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.

35

I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]).II Call kapproximate([0..n], 1, 0, ε, ε).

kapproximate([s’..e’], i, k’, PL’, Υ )begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j ≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ;end

36

• After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.

37

References

• [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc.

• Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192.

• [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on

• Discrete Algorithms, 2000, pp. 794–803.• [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Pro

c. Seventh Ann. Symp. on Combinatorial Pattern• Matching (CPM’96), pp. 1–23.• [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLE

I, vol. 1, November 1997, pp. 273–282.• [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772.• [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products.

in: ESA 2000, pp. 120–131.• [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Co

mbinatorial Pattern Matching (CPM’95), Lecture• Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54.• [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and

don’t cares, in: Proc. 36th Ann. ACM Symp. on• Theory of Computing, 2004, pp. 91–100.• [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IE

EE Symp. on Foundations of Computer Science• (FOCS’00), 2000, pp. 390–398.

38

• [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland,

• 1992.• [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text i

ndexing and string matching, in: Proc. 32nd ACM• Symp. on Theory of Computing, 2000, pp. 397–406.• [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Compu

tational Biology, Cambridge University Press,• Cambridge, 1997.• [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing fu

ll-text indices, in: Proc. IEEE Symp. on Foundations• of Computer Science, 2003.• [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. i

n: Proc. MFCS’91, Lecture Notes in Computer Science,• vol. 520, Springer, Berlin, 1991, pp. 240–248.• [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM

2003, pp. 186–199.• [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (197

7) 323–350.• [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, p

p. 200–210.• [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorit

hms 10 (1989) 157–169.• [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. C

omput. 22 (5) (1993) 935–948.

39

• [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272.

• [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88.

• [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern

• Matching (CPM’99), pp. 163–185.• [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matchin

g, J. Discrete Algorithms 1 (1) (2000) 205–239 18.• [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate stri

ng matching, IEEE Data Eng. Bull. 24 (4) (2001)• 19–27.• [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, i

n: Proc. 11th Ann. Symp. on Combinatorial Pattern• Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000.• [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems,

Genome Informatics 12 (2001) 175–183.• [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South

American Workshop on String Processing (WSP’96),• Carleton University Press, 1996.• [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc.

Seventh Ann. Symp. on Combinatorial Pattern Matching• (CPM’96), pp. 50–63.• [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Ma

tching 1993, vol. 4, Springer, Berlin, June 1993,• pp. 228–242.• [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 16

8–173.

40

Thank you!

Documents

Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu