Upload
ramon-prince
View
253
Download
1
Tags:
Embed Size (px)
Citation preview
String Matching
Algorithm : Design & Analysis
[19]
In the last class…
Optimal Binary Search Tree Separating Sequence of Word Dynamic Programming Algorithms
String Matching
Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan
String Matching: Problem Description
Search the text T, a string of characters of length n
For the pattern P, a string of characters of length m (usually, m<<n)
The result If T contains P as a substring, returning the index
starting the substring in T Otherwise: fail
Straightforward Solution
t1 … ti … ti+k-2 ti+k-1 … ti+m-1 … tn
p1 … pk-1 pk … pm
T :
…
P :
?
First matchedcharacter
Matched windowExpanding to right
Next comparison
Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn)
Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn)
Disadvantages of Backtracking
More comparisons are needed Up to m-1 most recently matched characters
have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)
An Intuitive Finite Automaton for Matching a Given Pattern
1 2 3 4 *
start node stop nodematched!
B,C
CB,C
A
BA
A A B C
Automaton for pattern “AABC”
Alphabet={A,B,C}
Advantage: each character in the text is checked only once
Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored
Advantage: each character in the text is checked only once
Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored
Why no backtracking?
Memorize the prefix.Why no backtracking?
Memorize the prefix.
The Knuth-Morris-Pratt Flowchart
Get next text char. AA BBB C *
1
2
3 4 5 6
An example: T=“ACABAABABA”, P=“ABABCB”
KMP cell number 1 2 1 0 1 2 3 4 2 1 2 3 4 5 3 4
Text being scanned 1 2 2 2 3 4 5 6 6 6 7 8 9 10 10 11
A C C C A B A A A A B A B A A -
Success or Failure s f f C s s s f f s s s s f s F
get next char.
Success
Failure
P: ABABABCB
T: ... ABABAB x …
Matched Frame
matched frame
to be compared next
If x is not C
P: ABAB ABCB
T: ... ABABAB x …
The matched frame move to right for 2 chars, which is equal to moving the pointers backward.
P: ABABABCB
T: ... ABABABABCB …
Moving for 4 chars may result in error.
Matched frame slides, with its breadth changed as well:
p1 …… pr-1 pr ……
p1 …… pk-r+1 …… pk-1
t1 …… ti …… pj-r+1 …… tj-1 tj ……
Sliding the Matched Frame
When dismatching occurs:
p1 …… pk-1 pk …… ……
t1 …… ti …… tj-1 tj ……
Matched frame Dismatching
New matched frame Next comparison
As large as possible.
As large as possible.
Fail Links
Out of each node of KMP flowchart is a fail link, leading to node r, where r is the largest non-negative interger satisfying r<k and p1,…,pr-1 matches pk-r+1,…,pk-1. (stored in fail[k])
Note: r is independent of T.
k
r
k-rP
P
pointer for P backward
pointer for T forward
Which means:
When fail at node k, next comparison is pk vs. pr
Which means:
When fail at node k, next comparison is pk vs. pr
Computing the Fail Links
Thinking recursively, let fail[k-1]=s:
p1 …… ps-1 ps ps+1 …… ……
p1 …… pk-r+1 …… pk-2 pk-1 pk …… pm
To be compared
Matched
Case 1
ps=pk-1
fail[k]=s+1
Case 2: pspk-1
p1… pfail[s]-1 pfail[s]
p1 …… ps-1 ps ps+1 ……
p1 … pk-r+1 …… pk-2 pk-1 pk …… pm
To be compared and thinking recursively
Recursion on Node fail[s]
Thinking recursively, at the beginning, s=fail[k-1]:
Case 2: pspk-1
p1… pfail[s]-1 pfail[s]
p1 …… ps-1 ps ps+1 ……
p1 … pk-r+1 …… pk-2 pk-1 pk …… pm
ps is replaced by pfail[s], that is, new value assumed for s
Then, proceeding on new s, that is:
If case 1 applys (ps=pk-1): fail[k]=s+1, or
If case 2 applys (pspk-1): another new s
Computing Fail Links: an Example
Constructing the KMP flowchart for P = “ABABABCB”
Assuming that fail[1] to fail[6] has been computed
Get next text char. A AB B A B C B *
0 3 4 5 6 7 8 91 2
fail[7]: ∵fail[6]=4, and p6=p4, ∴fail[7]=fail[6]+1=5 (case 1)
fail[8]: fail[7]=5, but p7p5, so, let s=fail[5]=3, but p7p3, keeping back, let s=fail[3]=1. Still p7p1. Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)
Constructing KMP Flowchart
Input: P, a string of characters; m, the length of POutput: fail, the array of failure links, filled
void kmpSetup (char [] P, int m, int [] fail) int k, s;
fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1;
For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s.
So, the complexity is roughly O(m2)
For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s.
So, the complexity is roughly O(m2)
Number of Character ComparisonsSuccess comparison: at most once for a specified k, totaling at most m-1
Success comparison: at most once for a specified k, totaling at most m-1
Unsuccess comparison:Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negativeSo, the counting of decreasing can not be larger than that of increasing
Unsuccess comparison:Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negativeSo, the counting of decreasing can not be larger than that of increasing
fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1)
break; s=fail[s]; fail[k]=s+1;
These 2 lines combine to increase s by 1, done m-2 times
2m-3
Input: P and T, the pattern and text; m, the length of P; fail: the array of failure links for P.Output: index in T where a copy of P begins, or -1 if no matchint kmpScan(char[ ] P, char[ ] T, int m, int[ ] fail) int match, j,k; //j indexes T, and k indexes P match=-1; j=1; k=1; while (endText(T,j)=false) if (k>m) match=j-m; break; if (k= =0) j++; k=1;
else if ( tj= =pk) j++; k++; //one character matched else k=fail[k]; //following the failure link
return match
KMP Scan: the Algorithm
Each time a new cycle begins, p1,…pk-1 matched
Each time a new cycle begins, p1,…pk-1 matched
Executed at most 2n times, why?
Skipping Charactersin String Matching
If you wish to understand others you must …
mustmust
mustmust
Checking the characters in P, in reverse order
mustmust
mustmust
mustmust
must
must
The copy of the P begins at t38. Matching is achieved in 18 comparisons
The copy of the P begins at t38. Matching is achieved in 18 comparisons
Distance of Jumping Forward
With the knowledge of P, the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T.
p1 … A … A … pm
p1 … A … A … ps … pm
t1 …… tj=A …… …… tn
current j new jRightmost ‘A’
charJump[‘A’] = m-k
=pk
Computing the Jump: AlgorithmInput: Pattern string P; m, the length of P; alphabet size alpha=||
Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet.
Input: Pattern string P; m, the length of P; alphabet size alpha=||
Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet.
void computeJumps(char[ ] P, int m, int alpha, int[ ] charJump
char ch; int k;
for (ch=0; ch<alpha; ch++) charJump[ch]=m; //For all char no in P, jump by m
for (k=1; km; k++) charJump[pk]=m-k; The increasing order of k ensure that
for duplicating symbols in P, the jump is computed according to the rightmost
(||+m)
Partially Matched Substring
P: b a t s a n d c a t s
T: …… d a t s ……
matched suffix
Current jcharJump[‘d’]=4
New jMove only 1 charRemember the
matched suffix, we can get a better jump
P: b a t s a n d c a t s
T: …… d a t s ……New jMove 7 chars
Forward to Match the Suffix
p1 …… pk pk+1 …… pm
t1 …… tj tj+1 …… …… tn
……Matched suffix
Dismatch
Substring same as the matched suffix occurs in Pp1 …… pr pr+1 …… pr+m-k …… pm
p1 …… pk pk+1 …… pm
t1 …… tj tj+1 …… …… tn
Old j New j
slide[k]
matchJump[k]
Partial Match for the Suffix
p1 …… pk pk+1 …… pm
t1 …… tj tj+1 …… …… tn
……Matched suffix
Dismatch
No entire substring same as the matched suffix occurs in Pp1 …… pq …… pm
p1 …… pk pk+1 …… pm
t1 …… tj tj+1 …… …… tn
Old j New j
slide[k]
matchJump[k]
May be empty
matchjump and slidep1 …… pr pr+1 …… pr+m-k …… pm
p1 …… pk pk+1 …… pm
t1 …… tj tj+1 …… …… tn
Old j New j
slide[k]
matchJump[k]
• slide[k]: the distance P slides forward after dismatch at pk, with m-k chars matched to the right• matchjump[k]: the distance j, the pointer of P, jumps, that is:
matchjump[k]=slide[k]+m-k• Let r(r<k) be the largest index, such that pr+1 starts a largest substring matching the matched suffix of P, and prpk, then slide[k]=k-r• If the r not found, the longest prefix of P, of length q, matching the matched suffix of P will be lined up. Then slide[k]=m-q.
Computing matchJump: Example
P = “ w o w w o w ”P = “ w o w w o w ”
matchJump[6]=1
Direction of computing
w o w w o w
t1 …… tj ……
Matched is empty
w o w w o w
matchJump[5]=3 w o w w o w
t1 …… tj w …… Matched is 1
w o w w o w
Slide[6]=1(m-k)=0
pk
pkSlide[5]=5-3=2(m-k)=1
Computing matchJump: Example
P = “ w o w w o w ”P = “ w o w w o w ”
matchJump[4]=7
Direction of computing
w o w w o w
t1 …… tj o w ……
Matched is 2
w o w w o w
matchJump[3]=6 w o w w o w
t1 …… tj w o w …… Matched is 3
w o w w o w
Not lined up
=pkNo found, buta prefix of length 1, so, Slide[4] = m-1=5
pk
Slide[3]=3-0=3(m-k)=3
Computing matchJump: Example
P = “ w o w w o w ”P = “ w o w w o w ”
matchJump[2]=7
Direction of computing
w o w w o w
t1 …… tj w w o w ……
Matched is 4
w o w w o w
matchJump[1]=8 w o w w o w
t1 …… tj o w w o w …… Matched is 5
w o w w o w
No found, buta prefix of length 3, so, Slide[2] = m-3=3
No found, buta prefix of length 3, so, Slide[1] = m-3=3
The Boyer-Moore Algorithm
Void computeMatchjumps(char[] P, int m, int[] matchjump) int k, r, s, low, shift; int sufx=new int[m+1]
for (k=1; km; k++) matchjump[k]=m+1; sufx[m]=m+1; for (k=m-1; k0; k--) s=sufix[k+1] while (sm) if (pk+1==ps) break; matchjump[s] = min (matchjump[s], s-(k+1)); s = sufx[k]; sufx[k]=s-1;
Sufx[k]=x means a substring starting from pk+1 matches suffix starting from px+1
Computing slide[k]
// computing prefix length is necessary;// change slide value to matchjump by addition;
Home Assignment
pp.508- 11.4 11.8 11.9 11.13 11.18