11 String Matching

String Matching

String Matching

String Matching Algorithms

Finding Patterns in a given Text

Datastructures: Tries, Suffix-Tries, Suffix Arrays

Algorithms:

Naive ApproachBoyer-MooreRabin-KarpKnuth-Morris-Pratt (KMP)

Literature: Dan Gusfield, Algorithms on strings, trees, and

sequences

CLRS (Cormen,. . .), Introduction to Algorithms

String Matching

Naive Approach

Naive Approach

n = text.size();

m = pattern.size();

for s = 0 to n - m {

if (pattern[1 .. m] = text[s+1 .. s+m]) add_result(s);

}

For T = an, P = am and m = n/2 the worst case occurs, yieldinga running time of Θ(n2).

String Matching

Rabin-Karp

Rabin-Karp

n=text.lenght();

m=pattern.length();

hpattern = hash(pattern)

htext = hash(text[0..m-1])

for s = 0 to n - m {

if (htext == hpattern)

if (pattern[1 .. m] = text[s .. s+m-1])

add_result(s);

htext = hash(s+1,s+m)

}

String Matching

Properties of Rabin-Karp

Properties of Rabin-Karp-Algorithm

Worst case running time (as for the naive approach) isO((n − m + 1)m).

On average good, i.e. O(n + m).

String Matching

Boyer-Moore

Compare right → left.

possible that some text chars are never compared

Good explanation in Dan Gusfield, Algorithms on strings trees

and sequences

Bad char shifts

String Matching

Boyer-Moore, strong good suffix rule

(strong) good suffix rule

T: prstabstubabvqxrst

*

P: qcabdabdab

String Matching




*

P: qcabdabdab

String Matching




*

P: qcabdabdab

String Matching




*

P: qcabdabdab

P: qcabdabdab

String Matching




*

P: qcabdabdab

String Matching




*

P: qcabdabdab

P: qcabdabdab

String Matching

Properties of Boyer-Moore

Properties of Boyer-Moore-Algorithm

Worst case if pattern is not in the text O(n).

Best case O(n/m) running time.

In practice one of the best known algorithms for stringmatching.

details see e.g.http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_s

String Matching

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

Properties of KMP

Properties of KMP-Algorithm

Worst case running time is O(n).

In practice most of the time slower than Boyer Moor buteasier to code.

no details here

extension: Aho-Corasick for matching multiple strings in onepass

String Matching

Tries

Tries

data structure for a set of strings

each node corresponds to a prefix of some string

each edge corresponds to a character

example stolen from wikipedia: to, tea, ten, i, in, and inn

it

eo n

nna

t i

in

inn

te

tea ten

to

3 12 9

7 5

11

String Matching

Suffix-Trees/Tries/Arrays

Suffix-Tries/Trees

preprocessing the text not the pattern

tree containing every suffix of a text (size?)

Fast searching for any substring

trie→tree: one edge for paths without branches

there are linear time algorithm for suffix trees (clearly linearsize)

Suffix Arrays

array of length |S | listing the suffixes of S in ascending order

(simple) search in m log n time

simple implementation in O(n2 log n) and O(n) space oftensufficient

String Matching

Documents

11 String Matching