1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee...

Preview:

Citation preview

1

String Matching Algorithms Based upon the Uniqueness Property

AdvisorAdvisor : : Prof. R. C. T. LeeProf. R. C. T. LeeSpeakerSpeaker : : C. W. LuC. W. Lu

C. W. Lu and R. C. T. Lee, 2007, String Matching Algorithms Based upon the Uniqueness Property, The 24th Workshop on Combinatorial Mathematics and Computation Theory, pp.385-392.

2

• String matching problem– Given a text string T of length n and a pattern

string P of length m.– Find all occurrences of P in T.

3

Rule 1: The Suffix to Prefix Rule• Suppose we have longest suffix u of a window which

is also a prefix of P, we can move P in such a way that the prefix u of P matches with the suffix u of the window.

u

u

(b)

T

P

(a)

u T

P u

Window

4

The Uniqueness Property of a String

• For any substring V of P, if V occurs in P only once, V is a unique substring.

• When V matches with some substring of T, we can move P such a way that the prefix of P matches with the suffix of V.

u T

P u u

V

V

P u u

V

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

T a c g c c g c g c c c g c g c t c a a a

P c a t a g t a g c c t0 1 2 3 4 5 6 7 8 9 10

Example

P = c a t a g t a g c c t

Suppose we use the substring “cc” as the unique substring.

P c a t a g t a g c c t0 1 2 3 4 5 6 7 8 9 10

6

Algorithm 1- The Longest Prefix with Unique Suffix Matching Algorithm

• We further modified the uniqueness by noting that the substring does not have to be unique in the entire pattern P. In fact, a substring which is unique in a prefix of P suffices.

• Therefore, we only have to find the longest prefix which contains a unique suffix in P.

7

Example

P = CACTAGCCACTCTC

The substring TC occurs twice in P, but it is unique in the prefix CACTAGCCACTC.

T : CTAGCGTATGCCAGTCACGATCGAGCAGGCTAC…

P : CACTAGCCACTCTC

P : CACTAGCCACTCTC

Move P 11 steps.

8

Example

P = CACTAGCCACTCTC

The substring G is also unique in the prefix CACTAG.

Move P 6 steps.

T : CTAGCGTATGCCAGTCACGATCGAGCAGGCTAC…

P : CACTAGCCACTCTC

P : CACTAGCCACTCTC

9

In the above example, using the unique substring TC, we could move P 11 steps if TC matches with TC in T; using the unique substring G, we could move P 6 steps if G matches with G in T.

P = CACTAGCCACTCTC

Is the unique substring TC better than the unique substring G?

10

• We should notice that if the unique substring appears in T many times, our algorithm would be efficient.

• In general, the probability of TC in P matching with TC in T exactly is 1/16 (Suppose the size of alphabet is 4), and the probability of G in P matching with G in T exactly is 1/4.

• Thus, the size of the unique substring is also important.

11

• If the substring TC in P exactly matches with TC in T once and moves P by 11 steps, the substring G in P may match G in T four times and moves P by 6 steps for each time. So, we expect that the substring G would be better than the substring TC in general.

P = CACTAGCCACTCTC

12

• We now define a ratio to determine which substring is better.

• Let Σ be the alphabet.

• The larger σ is, the better efficiency can be achieved in the searching phase.

substring of Size) moving of (Steps P

13

Preprocessing Phase

P = CAGACGACCCCAACAGC

Σ = {A, C, G, T}, |Σ| = 4.

Find the longest prefix with an unique suffix which size is one.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

T A C G C C G C G C C C G C G C T C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

. moving of steps

substringof Size 4

3

4

31

14

• We have found the unique substring with size 1, and we could use it to move P 3 steps.

• Next, we try to find an unique substring with size 2 such that we could use this substring to move P more than 3*4 steps.

• Thus, we only consider the substrings of p12p13…p16.

Preprocessing Phase

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

T A C G C C G C G C C C G C G C G C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G0 1 2 3 4 5

. moving of steps

substringof Size 14

162

15

Searching Phase

T … C G C C G C G C C C G C G C G C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Move 1 step.

If the unique substring mismatches, move P one step.

16

Searching Phase

T … C G C C G C G C C C G C G C G C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Move 16 steps.

If the unique substring GC matches with GC in T, move P 16 steps.

17

• As we discuss above, the size of the unique substring is important.

• In the following, we will introduce another algorithm which uses an unique substring with size one.

18

Algorithm 2- Longest Substring with Unique Character Matching Algorithm

• In the window, let x be any character. In order to have any meaningful matching of P with T, we must find the same x in P located in the left side of x in T.

x

(b)

T

P

(a)

x

x T

P x

19

• In preprocessing phase, we try to find the longest substring p’ in P such that x in p’ occurs only once. That is,

and pj occurs in p’ only once.

P x

P x x

(a) i = 1.

(b) i > 1.

p’

p’

ji ppp ...'

20

• If the unique character x matches with x in T, we can move P |p’| steps.

x T

P

x

x T

P x x

p’

p’

(b) i > 1.

(a) i = 1.

x

x x

21

Example

In this example, we would find the longest substring p4p5…p10 with a unique character p10.

If the character p10 matches with T, we can move P 7 steps.

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

22

Searching Phase

T … C G C C T C G C T C G C G T G C T A A …

Move 1 step.

If p10 mismatches, move P one step.

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

23

Searching Phase

T … C G C C T C G C T C G C G T G C T A A …

Move 7 steps.

If p10 matches with T, move P 7 steps.

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

24

Algorithm 3- The Unique Pairwise Substring Algorithm

• The substring pipi+1…pj-1pj is called an unique pairwise substring if it satisfies the condition that pipi+1…pj-1p

j occurs in the prefix p1p2…pj-1pj of P exactly once, and no pkpk+1…pk+j-i exists in p1p2…pj-1 such that pk = pi and pk+j-i = pj.

25

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Example

The substring TCG is an unique pairwise substring because no pkpk+1pk+2 exists in p1p2…p12 such that p

k = p11= T and pk+2 = p13= G.

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

The substring CAC is not an unique pairwise substring because there exists a substring p2p3p4 in p1p2…p9 such that p2 = p8= C and p4 = p10= C.

26

• Suppose pipi+1…pj-1pj is an unique pairwise substring.

• If pi and pj match with T, we have two cases to move P.

yxPi j

y

yxPi j

yk

yx

yx

T

Pi j

Case 1: such that pj = pk, where 0≦k≦j-i-1.

We can move P j-k steps.

k

27

Case 2: pj ≠ pk, where 0≦k ≦j-i-1.

We can move P j+1 steps.

k

yxPi j

y

yxPi j

yx

yx

T

Pi j

28

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Example

If we choose p11p12p13 as the unique pairwise substring, we can move P 14 steps when p11 and p13 match with T.

T … C G C C T C G C T C G T G G G C T A A …

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

29

• There would be many unique pairwise substrings in the pattern.

• We will select the one which is located at rightest in the pattern.

P C A C T C A G C G A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Example

The substrings p5p6, p7p8p9 and p11p12p13 are all unique pairwise substrings.

We would select p11p12p13 because it will have the largest move.

30

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

T … C G C C T C G C T C G T G G G C T A A …

Example

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

If p11 or p13 mismatch, move P one step.

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

31

P C C T C A G C C A C T C G C0 2 3 4 5 6 7 8 9 10 11 12 13 14

T … C G C C T C G C T C G T G G G C T A A …

Example

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

If p11 and p13 match with T, move P 14 steps.

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

32

References

• [1] Apostolico, A., Giancarlo, R., 1986, The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing 15(1):98-105.

• [2] Apostolico, A., Crochemore, M., 1991, Optimal canonization of all substrings of a string, Information and Computation 95(1):76-95.

• [3] Boyer, R.S., Moore, J.S., 1977, A fast string searching algorithm. Communications of the ACM. 20:762-772.

• [4] Colussi, L., 1991, Correctness and efficiency of the pattern matching algorithms, Information and Computation 95(2):225-251.

• [5] Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W., 1992, Deux méthodes pour accélérer l'algorithme de Boyer-Moore, in Théorie des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, pp 45-63, PUR 176, Rouen, France.

• [6] Colussi, L., 1994, Fastest pattern matching in strings, Journal of Algorithms. 16(2):163-189.

• [7] Charras, C., Lecroq, T., Pehoushek, J.D., 1998, A very fast string matching algorithm for small alphabets and long patterns, in Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching , M. Farach-Colton ed., Piscataway, New Jersey, Lecture Notes in Computer Science 1448, pp 55-64, Springer-Verlag, Berlin.

33

• [8] Galil, Z., Seiferas, J., 1983, Time-space optimal string matching, Journal of Computer and System Science 26(3):280-294.

• [9] Galil, Z., Giancarlo, R., 1992, On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, 21(3):407-437.

• [10] Horspool, R.N., 1980, Practical fast searching in strings, Software - Practice & Experience, 10(6):501-506.

• [11] Knuth, D.E., Morris (Jr), J.H., Pratt, V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing 6(1):323-350.

• [12] Lecroq, T., 1992, A variation on the Boyer-Moore algorithm, Theoretical Computer Science 92(1):119-144.

• [13] Morris (Jr), J.H., Pratt, V.R., 1970, A linear pattern-matching algorithm, Technical Report 40, University of California, Berkeley.

• [14] Sunday, D.M., 1990, A very fast substring search algorithm, Communications of the ACM . 33(8):132-142.

• [15] Simon, I., 1993, String matching algorithms and automata, in in Proceedings of 1st American Workshop on String Processing, R.A. Baeza-Yates and N. Ziviani ed., pp 151-157, Universidade Federal de Minas Gerais, Brazil.

34

Thanks for your attention.