View
220
Download
0
Category
Preview:
Citation preview
1
String Matching Algorithms Based upon the Uniqueness Property
AdvisorAdvisor : : Prof. R. C. T. LeeProf. R. C. T. LeeSpeakerSpeaker : : C. W. LuC. W. Lu
C. W. Lu and R. C. T. Lee, 2007, String Matching Algorithms Based upon the Uniqueness Property, The 24th Workshop on Combinatorial Mathematics and Computation Theory, pp.385-392.
2
• String matching problem– Given a text string T of length n and a pattern
string P of length m.– Find all occurrences of P in T.
3
Rule 1: The Suffix to Prefix Rule• Suppose we have longest suffix u of a window which
is also a prefix of P, we can move P in such a way that the prefix u of P matches with the suffix u of the window.
u
u
(b)
T
P
(a)
u T
P u
Window
4
The Uniqueness Property of a String
• For any substring V of P, if V occurs in P only once, V is a unique substring.
• When V matches with some substring of T, we can move P such a way that the prefix of P matches with the suffix of V.
u T
P u u
V
V
P u u
V
5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T a c g c c g c g c c c g c g c t c a a a
P c a t a g t a g c c t0 1 2 3 4 5 6 7 8 9 10
Example
P = c a t a g t a g c c t
Suppose we use the substring “cc” as the unique substring.
P c a t a g t a g c c t0 1 2 3 4 5 6 7 8 9 10
6
Algorithm 1- The Longest Prefix with Unique Suffix Matching Algorithm
• We further modified the uniqueness by noting that the substring does not have to be unique in the entire pattern P. In fact, a substring which is unique in a prefix of P suffices.
• Therefore, we only have to find the longest prefix which contains a unique suffix in P.
7
Example
P = CACTAGCCACTCTC
The substring TC occurs twice in P, but it is unique in the prefix CACTAGCCACTC.
T : CTAGCGTATGCCAGTCACGATCGAGCAGGCTAC…
P : CACTAGCCACTCTC
P : CACTAGCCACTCTC
Move P 11 steps.
8
Example
P = CACTAGCCACTCTC
The substring G is also unique in the prefix CACTAG.
Move P 6 steps.
T : CTAGCGTATGCCAGTCACGATCGAGCAGGCTAC…
P : CACTAGCCACTCTC
P : CACTAGCCACTCTC
9
In the above example, using the unique substring TC, we could move P 11 steps if TC matches with TC in T; using the unique substring G, we could move P 6 steps if G matches with G in T.
P = CACTAGCCACTCTC
Is the unique substring TC better than the unique substring G?
10
• We should notice that if the unique substring appears in T many times, our algorithm would be efficient.
• In general, the probability of TC in P matching with TC in T exactly is 1/16 (Suppose the size of alphabet is 4), and the probability of G in P matching with G in T exactly is 1/4.
• Thus, the size of the unique substring is also important.
11
• If the substring TC in P exactly matches with TC in T once and moves P by 11 steps, the substring G in P may match G in T four times and moves P by 6 steps for each time. So, we expect that the substring G would be better than the substring TC in general.
P = CACTAGCCACTCTC
12
• We now define a ratio to determine which substring is better.
• Let Σ be the alphabet.
• The larger σ is, the better efficiency can be achieved in the searching phase.
substring of Size) moving of (Steps P
13
Preprocessing Phase
P = CAGACGACCCCAACAGC
Σ = {A, C, G, T}, |Σ| = 4.
Find the longest prefix with an unique suffix which size is one.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T A C G C C G C G C C C G C G C T C A A A …
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
. moving of steps
substringof Size 4
3
4
31
14
• We have found the unique substring with size 1, and we could use it to move P 3 steps.
• Next, we try to find an unique substring with size 2 such that we could use this substring to move P more than 3*4 steps.
• Thus, we only consider the substrings of p12p13…p16.
Preprocessing Phase
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T A C G C C G C G C C C G C G C G C A A A …
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
P C A G A C G0 1 2 3 4 5
. moving of steps
substringof Size 14
162
15
Searching Phase
T … C G C C G C G C C C G C G C G C A A A …
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Move 1 step.
If the unique substring mismatches, move P one step.
16
Searching Phase
T … C G C C G C G C C C G C G C G C A A A …
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Move 16 steps.
If the unique substring GC matches with GC in T, move P 16 steps.
17
• As we discuss above, the size of the unique substring is important.
• In the following, we will introduce another algorithm which uses an unique substring with size one.
18
Algorithm 2- Longest Substring with Unique Character Matching Algorithm
• In the window, let x be any character. In order to have any meaningful matching of P with T, we must find the same x in P located in the left side of x in T.
x
(b)
T
P
(a)
x
x T
P x
19
• In preprocessing phase, we try to find the longest substring p’ in P such that x in p’ occurs only once. That is,
and pj occurs in p’ only once.
P x
P x x
(a) i = 1.
(b) i > 1.
p’
p’
ji ppp ...'
20
• If the unique character x matches with x in T, we can move P |p’| steps.
x T
P
x
x T
P x x
p’
p’
(b) i > 1.
(a) i = 1.
x
x x
21
Example
In this example, we would find the longest substring p4p5…p10 with a unique character p10.
If the character p10 matches with T, we can move P 7 steps.
P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13
22
Searching Phase
T … C G C C T C G C T C G C G T G C T A A …
Move 1 step.
If p10 mismatches, move P one step.
P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13
P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13
P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13
23
Searching Phase
T … C G C C T C G C T C G C G T G C T A A …
Move 7 steps.
If p10 matches with T, move P 7 steps.
P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13
P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13
P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13
24
Algorithm 3- The Unique Pairwise Substring Algorithm
• The substring pipi+1…pj-1pj is called an unique pairwise substring if it satisfies the condition that pipi+1…pj-1p
j occurs in the prefix p1p2…pj-1pj of P exactly once, and no pkpk+1…pk+j-i exists in p1p2…pj-1 such that pk = pi and pk+j-i = pj.
25
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Example
The substring TCG is an unique pairwise substring because no pkpk+1pk+2 exists in p1p2…p12 such that p
k = p11= T and pk+2 = p13= G.
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
The substring CAC is not an unique pairwise substring because there exists a substring p2p3p4 in p1p2…p9 such that p2 = p8= C and p4 = p10= C.
26
• Suppose pipi+1…pj-1pj is an unique pairwise substring.
• If pi and pj match with T, we have two cases to move P.
yxPi j
y
yxPi j
yk
yx
yx
T
Pi j
Case 1: such that pj = pk, where 0≦k≦j-i-1.
We can move P j-k steps.
k
27
Case 2: pj ≠ pk, where 0≦k ≦j-i-1.
We can move P j+1 steps.
k
yxPi j
y
yxPi j
yx
yx
T
Pi j
28
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Example
If we choose p11p12p13 as the unique pairwise substring, we can move P 14 steps when p11 and p13 match with T.
T … C G C C T C G C T C G T G G G C T A A …
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
29
• There would be many unique pairwise substrings in the pattern.
• We will select the one which is located at rightest in the pattern.
P C A C T C A G C G A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Example
The substrings p5p6, p7p8p9 and p11p12p13 are all unique pairwise substrings.
We would select p11p12p13 because it will have the largest move.
30
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
T … C G C C T C G C T C G T G G G C T A A …
Example
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
If p11 or p13 mismatch, move P one step.
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
31
P C C T C A G C C A C T C G C0 2 3 4 5 6 7 8 9 10 11 12 13 14
T … C G C C T C G C T C G T G G G C T A A …
Example
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
If p11 and p13 match with T, move P 14 steps.
P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
32
References
• [1] Apostolico, A., Giancarlo, R., 1986, The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing 15(1):98-105.
• [2] Apostolico, A., Crochemore, M., 1991, Optimal canonization of all substrings of a string, Information and Computation 95(1):76-95.
• [3] Boyer, R.S., Moore, J.S., 1977, A fast string searching algorithm. Communications of the ACM. 20:762-772.
• [4] Colussi, L., 1991, Correctness and efficiency of the pattern matching algorithms, Information and Computation 95(2):225-251.
• [5] Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W., 1992, Deux méthodes pour accélérer l'algorithme de Boyer-Moore, in Théorie des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, pp 45-63, PUR 176, Rouen, France.
• [6] Colussi, L., 1994, Fastest pattern matching in strings, Journal of Algorithms. 16(2):163-189.
• [7] Charras, C., Lecroq, T., Pehoushek, J.D., 1998, A very fast string matching algorithm for small alphabets and long patterns, in Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching , M. Farach-Colton ed., Piscataway, New Jersey, Lecture Notes in Computer Science 1448, pp 55-64, Springer-Verlag, Berlin.
33
• [8] Galil, Z., Seiferas, J., 1983, Time-space optimal string matching, Journal of Computer and System Science 26(3):280-294.
• [9] Galil, Z., Giancarlo, R., 1992, On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, 21(3):407-437.
• [10] Horspool, R.N., 1980, Practical fast searching in strings, Software - Practice & Experience, 10(6):501-506.
• [11] Knuth, D.E., Morris (Jr), J.H., Pratt, V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing 6(1):323-350.
• [12] Lecroq, T., 1992, A variation on the Boyer-Moore algorithm, Theoretical Computer Science 92(1):119-144.
• [13] Morris (Jr), J.H., Pratt, V.R., 1970, A linear pattern-matching algorithm, Technical Report 40, University of California, Berkeley.
• [14] Sunday, D.M., 1990, A very fast substring search algorithm, Communications of the ACM . 33(8):132-142.
• [15] Simon, I., 1993, String matching algorithms and automata, in in Proceedings of 1st American Workshop on String Processing, R.A. Baeza-Yates and N. Ziviani ed., pp 151-157, Universidade Federal de Minas Gerais, Brazil.
34
Thanks for your attention.
Recommended