String Matching Algorithm : Design & Analysis [19]

String Matching

Algorithm : Design & Analysis

[19]

In the last class…

Optimal Binary Search Tree Separating Sequence of Word Dynamic Programming Algorithms

String Matching

Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan

String Matching: Problem Description

Search the text T, a string of characters of length n

For the pattern P, a string of characters of length m (usually, m<<n)

The result If T contains P as a substring, returning the index

starting the substring in T Otherwise: fail

Straightforward Solution

t1 … ti … ti+k-2 ti+k-1 … ti+m-1 … tn

p1 … pk-1 pk … pm

T :

…

P :

?

First matchedcharacter

Matched windowExpanding to right

Next comparison

Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn)

Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn)

Disadvantages of Backtracking

More comparisons are needed Up to m-1 most recently matched characters

have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)

An Intuitive Finite Automaton for Matching a Given Pattern

1 2 3 4 *

start node stop nodematched!

B,C

CB,C

A

BA

A A B C

Automaton for pattern “AABC”

Alphabet={A,B,C}

Advantage: each character in the text is checked only once

Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored

Advantage: each character in the text is checked only once

Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored

Why no backtracking?

Memorize the prefix.Why no backtracking?

Memorize the prefix.

The Knuth-Morris-Pratt Flowchart

Get next text char. AA BBB C *

1

2

3 4 5 6

An example: T=“ACABAABABA”, P=“ABABCB”

KMP cell number 1 2 1 0 1 2 3 4 2 1 2 3 4 5 3 4

Text being scanned 1 2 2 2 3 4 5 6 6 6 7 8 9 10 10 11

A C C C A B A A A A B A B A A -

Success or Failure s f f C s s s f f s s s s f s F

get next char.

Success

Failure

P: ABABABCB

T: ... ABABAB x …

Matched Frame

matched frame

to be compared next

If x is not C

P: ABAB ABCB

T: ... ABABAB x …

The matched frame move to right for 2 chars, which is equal to moving the pointers backward.

P: ABABABCB

T: ... ABABABABCB …

Moving for 4 chars may result in error.

Matched frame slides, with its breadth changed as well:

p1 …… pr-1 pr ……

p1 …… pk-r+1 …… pk-1

t1 …… ti …… pj-r+1 …… tj-1 tj ……

Sliding the Matched Frame

When dismatching occurs:

p1 …… pk-1 pk …… ……

t1 …… ti …… tj-1 tj ……

Matched frame Dismatching

New matched frame Next comparison

As large as possible.

As large as possible.

Fail Links

Out of each node of KMP flowchart is a fail link, leading to node r, where r is the largest non-negative interger satisfying r<k and p1,…,pr-1 matches pk-r+1,…,pk-1. (stored in fail[k])

Note: r is independent of T.

k

r

k-rP

P

pointer for P backward

pointer for T forward

Which means:

When fail at node k, next comparison is pk vs. pr

Which means:

When fail at node k, next comparison is pk vs. pr

Computing the Fail Links

Thinking recursively, let fail[k-1]=s:

p1 …… ps-1 ps ps+1 …… ……

p1 …… pk-r+1 …… pk-2 pk-1 pk …… pm

To be compared

Matched

Case 1

ps=pk-1

fail[k]=s+1

Case 2: pspk-1

p1… pfail[s]-1 pfail[s]

p1 …… ps-1 ps ps+1 ……

p1 … pk-r+1 …… pk-2 pk-1 pk …… pm

To be compared and thinking recursively

Recursion on Node fail[s]

Thinking recursively, at the beginning, s=fail[k-1]:

Case 2: pspk-1

p1… pfail[s]-1 pfail[s]

p1 …… ps-1 ps ps+1 ……

p1 … pk-r+1 …… pk-2 pk-1 pk …… pm

ps is replaced by pfail[s], that is, new value assumed for s

Then, proceeding on new s, that is:

If case 1 applys (ps=pk-1): fail[k]=s+1, or

If case 2 applys (pspk-1): another new s

Computing Fail Links: an Example

Constructing the KMP flowchart for P = “ABABABCB”

Assuming that fail[1] to fail[6] has been computed

Get next text char. A AB B A B C B *

0 3 4 5 6 7 8 91 2

fail[7]: ∵fail[6]=4, and p6=p4, ∴fail[7]=fail[6]+1=5 (case 1)

fail[8]: fail[7]=5, but p7p5, so, let s=fail[5]=3, but p7p3, keeping back, let s=fail[3]=1. Still p7p1. Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)

Constructing KMP Flowchart

Input: P, a string of characters; m, the length of POutput: fail, the array of failure links, filled

void kmpSetup (char [] P, int m, int [] fail) int k, s;

fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1;

For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s.

So, the complexity is roughly O(m2)

For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s.

So, the complexity is roughly O(m2)

Number of Character ComparisonsSuccess comparison: at most once for a specified k, totaling at most m-1

Success comparison: at most once for a specified k, totaling at most m-1

Unsuccess comparison:Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negativeSo, the counting of decreasing can not be larger than that of increasing

Unsuccess comparison:Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negativeSo, the counting of decreasing can not be larger than that of increasing

fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1)

break; s=fail[s]; fail[k]=s+1;

These 2 lines combine to increase s by 1, done m-2 times

2m-3

Input: P and T, the pattern and text; m, the length of P; fail: the array of failure links for P.Output: index in T where a copy of P begins, or -1 if no matchint kmpScan(char[ ] P, char[ ] T, int m, int[ ] fail) int match, j,k; //j indexes T, and k indexes P match=-1; j=1; k=1; while (endText(T,j)=false) if (k>m) match=j-m; break; if (k= =0) j++; k=1;

else if ( tj= =pk) j++; k++; //one character matched else k=fail[k]; //following the failure link

return match

KMP Scan: the Algorithm

Each time a new cycle begins, p1,…pk-1 matched

Each time a new cycle begins, p1,…pk-1 matched

Executed at most 2n times, why?

Skipping Charactersin String Matching

If you wish to understand others you must …

mustmust

mustmust

Checking the characters in P, in reverse order

mustmust

mustmust

mustmust

must

must

The copy of the P begins at t38. Matching is achieved in 18 comparisons

The copy of the P begins at t38. Matching is achieved in 18 comparisons

Distance of Jumping Forward

With the knowledge of P, the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T.

p1 … A … A … pm

p1 … A … A … ps … pm

t1 …… tj=A …… …… tn

current j new jRightmost ‘A’

charJump[‘A’] = m-k

=pk

Computing the Jump: AlgorithmInput: Pattern string P; m, the length of P; alphabet size alpha=||

Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet.

Input: Pattern string P; m, the length of P; alphabet size alpha=||

Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet.

void computeJumps(char[ ] P, int m, int alpha, int[ ] charJump

char ch; int k;

for (ch=0; ch<alpha; ch++) charJump[ch]=m; //For all char no in P, jump by m

for (k=1; km; k++) charJump[pk]=m-k; The increasing order of k ensure that

for duplicating symbols in P, the jump is computed according to the rightmost

(||+m)

Partially Matched Substring

P: b a t s a n d c a t s

T: …… d a t s ……

matched suffix

Current jcharJump[‘d’]=4

New jMove only 1 charRemember the

matched suffix, we can get a better jump

P: b a t s a n d c a t s

T: …… d a t s ……New jMove 7 chars

Forward to Match the Suffix

p1 …… pk pk+1 …… pm

t1 …… tj tj+1 …… …… tn

……Matched suffix

Dismatch

Substring same as the matched suffix occurs in Pp1 …… pr pr+1 …… pr+m-k …… pm

p1 …… pk pk+1 …… pm

t1 …… tj tj+1 …… …… tn

Old j New j

slide[k]

matchJump[k]

Partial Match for the Suffix

p1 …… pk pk+1 …… pm

t1 …… tj tj+1 …… …… tn

……Matched suffix

Dismatch

No entire substring same as the matched suffix occurs in Pp1 …… pq …… pm

p1 …… pk pk+1 …… pm

t1 …… tj tj+1 …… …… tn

Old j New j

slide[k]

matchJump[k]

May be empty

matchjump and slidep1 …… pr pr+1 …… pr+m-k …… pm

p1 …… pk pk+1 …… pm

t1 …… tj tj+1 …… …… tn

Old j New j

slide[k]

matchJump[k]

• slide[k]: the distance P slides forward after dismatch at pk, with m-k chars matched to the right• matchjump[k]: the distance j, the pointer of P, jumps, that is:

matchjump[k]=slide[k]+m-k• Let r(r<k) be the largest index, such that pr+1 starts a largest substring matching the matched suffix of P, and prpk, then slide[k]=k-r• If the r not found, the longest prefix of P, of length q, matching the matched suffix of P will be lined up. Then slide[k]=m-q.

Computing matchJump: Example

P = “ w o w w o w ”P = “ w o w w o w ”

matchJump[6]=1

Direction of computing

w o w w o w

t1 …… tj ……

Matched is empty

w o w w o w

matchJump[5]=3 w o w w o w

t1 …… tj w …… Matched is 1

w o w w o w

Slide[6]=1(m-k)=0

pk

pkSlide[5]=5-3=2(m-k)=1


P = “ w o w w o w ”P = “ w o w w o w ”

matchJump[4]=7


w o w w o w

t1 …… tj o w ……

Matched is 2

w o w w o w


t1 …… tj w o w …… Matched is 3

w o w w o w

Not lined up

=pkNo found, buta prefix of length 1, so, Slide[4] = m-1=5

pk

Slide[3]=3-0=3(m-k)=3


P = “ w o w w o w ”P = “ w o w w o w ”

matchJump[2]=7


w o w w o w

t1 …… tj w w o w ……

Matched is 4

w o w w o w


t1 …… tj o w w o w …… Matched is 5

w o w w o w

No found, buta prefix of length 3, so, Slide[2] = m-3=3

No found, buta prefix of length 3, so, Slide[1] = m-3=3

The Boyer-Moore Algorithm

Void computeMatchjumps(char[] P, int m, int[] matchjump) int k, r, s, low, shift; int sufx=new int[m+1]

for (k=1; km; k++) matchjump[k]=m+1; sufx[m]=m+1; for (k=m-1; k0; k--) s=sufix[k+1] while (sm) if (pk+1==ps) break; matchjump[s] = min (matchjump[s], s-(k+1)); s = sufx[k]; sufx[k]=s-1;

Sufx[k]=x means a substring starting from pk+1 matches suffix starting from px+1

Computing slide[k]

// computing prefix length is necessary;// change slide value to matchjump by addition;

Home Assignment

pp.508- 11.4 11.8 11.9 11.13 11.18

Documents

String Matching Algorithm : Design & Analysis [19]