Upload
tieppv
View
117
Download
1
Embed Size (px)
Citation preview
String Matching
String Matching
String Matching Algorithms
Finding Patterns in a given Text
Datastructures: Tries, Suffix-Tries, Suffix Arrays
Algorithms:
Naive ApproachBoyer-MooreRabin-KarpKnuth-Morris-Pratt (KMP)
Literature: Dan Gusfield, Algorithms on strings, trees, and
sequences
CLRS (Cormen,. . .), Introduction to Algorithms
String Matching
Naive Approach
Naive Approach
n = text.size();
m = pattern.size();
for s = 0 to n - m {
if (pattern[1 .. m] = text[s+1 .. s+m]) add_result(s);
}
For T = an, P = am and m = n/2 the worst case occurs, yieldinga running time of Θ(n2).
String Matching
Rabin-Karp
Rabin-Karp
n=text.lenght();
m=pattern.length();
hpattern = hash(pattern)
htext = hash(text[0..m-1])
for s = 0 to n - m {
if (htext == hpattern)
if (pattern[1 .. m] = text[s .. s+m-1])
add_result(s);
htext = hash(s+1,s+m)
}
String Matching
Properties of Rabin-Karp
Properties of Rabin-Karp-Algorithm
Worst case running time (as for the naive approach) isO((n − m + 1)m).
On average good, i.e. O(n + m).
String Matching
Boyer-Moore
Compare right → left.
possible that some text chars are never compared
Good explanation in Dan Gusfield, Algorithms on strings trees
and sequences
Bad char shifts
String Matching
Boyer-Moore, strong good suffix rule
(strong) good suffix rule
T: prstabstubabvqxrst
*
P: qcabdabdab
String Matching
Boyer-Moore, strong good suffix rule
(strong) good suffix rule
T: prstabstubabvqxrst
*
P: qcabdabdab
String Matching
Boyer-Moore, strong good suffix rule
(strong) good suffix rule
T: prstabstubabvqxrst
*
P: qcabdabdab
String Matching
Boyer-Moore, strong good suffix rule
(strong) good suffix rule
T: prstabstubabvqxrst
*
P: qcabdabdab
P: qcabdabdab
String Matching
Boyer-Moore, strong good suffix rule
(strong) good suffix rule
T: prstabstubabvqxrst
*
P: qcabdabdab
String Matching
Boyer-Moore, strong good suffix rule
(strong) good suffix rule
T: prstabstubabvqxrst
*
P: qcabdabdab
P: qcabdabdab
String Matching
Properties of Boyer-Moore
Properties of Boyer-Moore-Algorithm
Worst case if pattern is not in the text O(n).
Best case O(n/m) running time.
In practice one of the best known algorithms for stringmatching.
details see e.g.http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_s
String Matching
Properties of KMP
Properties of KMP-Algorithm
Worst case running time is O(n).
In practice most of the time slower than Boyer Moor buteasier to code.
no details here
extension: Aho-Corasick for matching multiple strings in onepass
String Matching
Tries
Tries
data structure for a set of strings
each node corresponds to a prefix of some string
each edge corresponds to a character
example stolen from wikipedia: to, tea, ten, i, in, and inn
it
eo n
nna
t i
in
inn
te
tea ten
to
3 12 9
7 5
11
String Matching
Suffix-Trees/Tries/Arrays
Suffix-Tries/Trees
preprocessing the text not the pattern
tree containing every suffix of a text (size?)
Fast searching for any substring
trie→tree: one edge for paths without branches
there are linear time algorithm for suffix trees (clearly linearsize)
Suffix Arrays
array of length |S | listing the suffixes of S in ascending order
(simple) search in m log n time
simple implementation in O(n2 log n) and O(n) space oftensufficient
String Matching