View
230
Download
0
Embed Size (px)
Citation preview
Pattern Matching 1
Pattern Matching
1
a b a c a a b
234
a b a c a b
a b a c a b
Pattern Matching 2
StringsA string is a sequence of charactersExamples of strings:
Java program HTML document DNA sequence Digitized image
An alphabet is the set of possible characters for a family of stringsExample of alphabets:
ASCII Unicode {0, 1} {A, C, G, T}
Let P be a string of size m A substring P[i .. j] of P is the
subsequence of P consisting of the characters with ranks between i and j
A prefix of P is a substring of the type P[0 .. i]
A suffix of P is a substring of the type P[i ..m 1]
Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to PApplications:
Text editors Search engines Biological research
Pattern Matching 3
Brute-Force AlgorithmThe brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either
a match is found, or all placements of the
pattern have been triedBrute-force pattern matching runs in time O(nm) Example of worst case:
T aaa … ah P aaah may occur in images and
DNA sequences unlikely in English text
Algorithm BruteForceMatch(T, P)Input text T of size n and pattern
P of size mOutput starting index of a
substring of T equal to P or 1 if no such substring exists
for i 0 to n m{ test shift i of the pattern }j 0while j m T[i j] P[j]
j j 1if j m
return i {match at i}else
break while loop {mismatch}return -1 {no match anywhere}
Pattern Matching 4
Boyer-Moore HeuristicsThe Boyer-Moore’s pattern matching algorithm is based on two heuristicsLooking-glass heuristic: Compare P with a subsequence of T moving backwardsCharacter-jump heuristic: When a mismatch occurs at T[i] c
If P contains c, shift P to align the last occurrence of c in P with T[i]
Else, shift P to align P[0] with T[i 1]
Example
1
a p a t t e r n m a t c h i n g a l g o r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
2
3
4
5
6
7891011
Pattern Matching 5
Last-Occurrence FunctionBoyer-Moore’s algorithm preprocesses the pattern P and the alphabet to build the last-occurrence function L mapping to integers, where L(c) is defined as
the largest index i such that P[i] c or 1 if no such index exists
Example: {a, b, c, d} P abacab
The last-occurrence function can be represented by an array indexed by the numeric codes of the charactersThe last-occurrence function can be computed in time O(m s), where m is the size of P and s is the size of
c a b c d
L(c) 4 5 3 1
Pattern Matching 6
m j
i
j l
. . . . . . a . . . . . .
. . . . b a
. . . . b a
j
Case 1: j 1l
The Boyer-Moore AlgorithmAlgorithm BoyerMooreMatch(T, P, )
L lastOccurenceFunction(P, )i m 1j m 1repeat
if T[i] P[j]if j 0
return i { match at i }
elsei i 1j j 1
else{ character-jump }l L[T[i]]i i m – min(j, 1l)j m 1
until i n 1return 1 { no match }
m (1 l)
i
jl
. . . . . . a . . . . . .
. a . . b .
. a . . b .
1 l
Case 2: 1lj
Pattern Matching 7
Example
1
a b a c a a b a d c a b a c a b a a b b
234
5
6
7
891012
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b1113
Pattern Matching 8
AnalysisBoyer-Moore’s algorithm runs in time O(nm s)
Example of worst case: T aaa … a P baaa
The worst case may occur in images and DNA sequences but is unlikely in English textBoyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text
11
1
a a a a a a a a a
23456
b a a a a a
b a a a a a
b a a a a a
b a a a a a
7891012
131415161718
192021222324
Pattern Matching 9
The KMP Algorithm - MotivationKnuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons?Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]
x
j
. . a b a a b . . . . .
a b a a b a
a b a a b a
No need torepeat thesecomparisons
Resumecomparing
here
Pattern Matching 10
KMP Failure FunctionKnuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itselfThe failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j]T[i] we set j F(j 1)
j 0 1 2 3 4
P[j] a b a a b a
F(j) 0 0 1 1 2
x
j
. . a b a a b . . . . .
a b a a b a
F(j 1)
a b a a b a
Pattern Matching 11
The KMP AlgorithmThe failure function can be represented by an array and can be computed in O(m) timeAt each iteration of the while-loop, either
i increases by one, or the shift amount i j
increases by at least one (observe that F(j 1) < j)
Hence, there are no more than 2n iterations of the while-loopThus, KMP’s algorithm runs in optimal time O(m n)
Algorithm KMPMatch(T, P)F failureFunction(P)i 0j 0while i n
if T[i] P[j]if j m 1
return i j { match }
elsei i 1j j 1
elseif j 0
j F[j 1]else
i i 1return 1 { no match }
Pattern Matching 12
Computing the Failure Function
The failure function can be represented by an array and can be computed in O(m) timeThe construction is similar to the KMP algorithm itselfAt each iteration of the while-loop, either
i increases by one, or the shift amount i j
increases by at least one (observe that F(j 1) < j)
Hence, there are no more than 2m iterations of the while-loop
Algorithm failureFunction(P)F[0] 0i 1j 0while i m
if P[i] P[j]{we have matched j + 1
chars}F[i] j + 1i i 1j j 1
else if j 0 then{use failure function to
shift P}j F[j 1]
elseF[i] 0 { no match }i i 1
Pattern Matching 13
Example
1
a b a c a a b a c a b a c a b a a b b
7
8
19181715
a b a c a b
1614
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12
c
j 0 1 2 3 4
P[j] a b a c a b
F(j) 0 0 1 0 1
Pattern Matching 14
Binary Failure functionFor your assignment, you are to do the binary failure function. Since there are only two possible charaters, when you fail at a character, you know what you were looking at when you failed. Thus, you state the maximum number of character that match the previous characters of the pattern AND the opposite of the current character.
Binary Failure
Regular Failure function
a a b b b b a a a b a a b a b
0 0 2 1 1 1 0 0 3 2 4 0 2 4 2
a a b b b b a a a b a a b a b
0 1 0 0 0 0 1 2 2 3 1 2 3 1 0
Pattern Matching 15
Tries: Basic IdeasPreprocess fixed text rather than patternStore strings in trees, one character per nodeUse in search engines, dictionaries, prefixesFixed alphabet with canonical orderingUse special character as word terminator
Pattern Matching 16
Tries are great if
Word matching (know where word begins)Text is large, immutable, and searched often.Web crawlers (for example) can afford to preprocessed text ahead of time knowing that MANY people want to search contents of all web pages.
Pattern Matching 17
FactsPrefixes with length i stop in level i# leaves = # strings (words in text)Is a multi-way tree, used similarly to the way we use a binary search tree.Tree height = length of longest wordTree size O(combined length of all words)Insertion and search as in multi-way ordered trees, O(word length)Word, not substring, matchingCould use 27-ary trees insteadExclude stop words from trie as won’t search for them
Pattern Matching 18
Trie Example
Pattern Matching 19
Compressed TriesWhen there is only one child of a node, a waste of space, so store substrings at nodesThen the tree size is O(s), the number of words
Pattern Matching 20
Compressed Tries with Indexesavoids variable length strings
Pattern Matching 21
Suffix TriesTree of all suffixes of a stringUsed for substrings, not just full wordsUsed in pattern matching – a substring is the prefix of a suffix (all words are from same string)Changes linear search for the beginning of the pattern to a tree search
Pattern Matching 22
Suffix Tries are efficientIn space, O(n) rather than O(n2), because characters only need to appear once
Efficient – O(dn) to construct, O(dm) to used is size of alphabet
Pattern Matching 23
Search Engines
Inverted index (file) has words as keys, occurrence lists (webpages) as value (access by content)Also called concordance – omit stop wordsCan use a trie effectivelyMultiple keywords return the intersection of occurrence listsCan use sequences in fixed order with merging for intersection gives ranking – major challenge
Pattern Matching 24
Text Compression and SimilarityA. Text Compression 1. Text characters are encoded as
binary integers; different encodings may result in more or fewer bits to represent the original text
a. Compression is achieved by using variable, rather than fixed size encoding (e.g. ASCII or Unicode)
b. Compression is valuable in reduced bandwidth communication, storage space minimization
Pattern Matching 25
2. Huffman encoding a. Shorter encodings for more
frequently occurring characters b. Prefix code - can’t have one
code be a prefix of another c. Most useful when character
frequencies differ widely
Encoding may change from text to text, or may be defined for a class of texts, like Morse code.
Pattern Matching 26
Huffman algorithm uses binary treesStart with an individual tree for each character, storing character and frequency at root.
Iteratively merge trees with two smallest frequencies at the root, writing sum of frequencies of children at each internal node. Greedy Algorithm
Pattern Matching 27
Complexity is O(n + d log d), where the text of n characters has d distinct characters
n is to process the text calculating frequencies
d log d is the cost of heaping the frequency trees, then iteratively removing two, merging, and inserting one.
Pattern Matching 28
Text SimilarityDetect similarity to focus on, or ignore, slight differencesa. DNA analysisb. Web crawlers omit duplicate pages, distinguish between similar onesc. Updated files, archiving, delta files, and editing distance
Pattern Matching 29
Longest Common SubsequenceOne measure of similarity is the length of the longest common subsequence between two texts. This is NOT a contiguous substring, so it loses a great deal of structure. I doubt that it is an effective metric for similarity, unless the subsequence is a substantial part of the whole text.
Pattern Matching 30
LCS algorithm uses the dynamic programming approach
How do we write LCS in terms of other LCS problems? The parameters for the smaller problems being composed to solve a larger problem are the lengths of a prefix of X and a prefix of Y.
Pattern Matching 31
Find recursion:Let L(i,j) be the LCS between two strings X(0..i) and Y(0..j).
Suppose we know L(i, j), L(i+1, j) and L(i, j+1) and want to know L(i+1, j+1). a. If X[i+1] = Y[j+1] then it is L(i, j) + 1.b. If X[i+1] != Y[j+1] then it is max(L[i, j+1], L(i+1, j))
Pattern Matching 32
* a b c d g h t h m s
* 0 0 0 0 0 0 0 0 0 0 0
a 0 1 1 1 1 1 1 1 1 1 1
e 0 1 1 1 1 1 1 1 1 1 1
d 0 1 1 1 2 2 2 2 2 2 2
f 0 1 1 1 2 2 2 2 2 2 2
h 0 1 1 1 2 2 3 3 3 3 3
h 0 1 1 1 2 2 3 3 4 4 4
Pattern Matching 33
* i d o n o t l i k e
* 0 0 0 0 0 0 0 0 0 0 0
n 0
o 0
t 0
i 0
c 0
e 0
Pattern Matching 34
This algorithm initializes the array or table for L by putting 0’s along the borders, then is a simple nested loop filling up values row by row. This it runs in O(nm)
While the algorithm only tells the length of the LCS, the actual string can easily be found by working backward through the table (and string), noting points at which the two characters are equal
Pattern Matching 35
This material in not in your text (except as exercises)
Sequence Comparisons Problems in molecular biology involve
finding the minimum number of edit steps which are required to change one string into another.
Three types of edit steps: insert, delete, replace.
Example: abbc babb abbc bbc bbb babb (3 steps) abbc babbc babb (2 steps) We are trying to minimize the number of
steps.
Pattern Matching 36
Idea: look at making just one position right. Find all the ways you could use.
Count how long each would take and recursively figure total cost.
Orderly way of limiting the exponential number of combinations to think about.
For ease in coding, we make the last character right (rather than any other).
Pattern Matching 37
There are four possibilities (pick the cheapest)
1. If we delete an, we need to change A(0..n-1) to B(0..m). The cost is C(n,m) = C(n-1,m) + 1
C(n,m) is the cost of changing the first n of str1 to the first m of str2.
2. If we insert a new value at the end of A(n) to match bm, we would still have to change A(n) to B(m-1). The cost is C(n,m) = C(n,m-1) + 1
3. If we replace an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1) + 1
4. If we match an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1)
Pattern Matching 38
We have turned one problem into three problems - just slightly smaller.
Bad situation - unless we can reuse results. Dynamic Programming.
We store the results of C(i,j) for i = 1,n and j = 1,m.
If we need to reconstruct how we would achieve the change, we store both the cost and an indication of which set of subproblems was used.
Pattern Matching 39
M(i,j) which indicates which of the four decisions lead to the best result.
Complexity: O(mn) - but needs O(mn) space as well.
Consider changing do to redo: Consider changing mane to mean:
Pattern Matching 40
Changing “do” to “redo”Assume: match is free; others are 1
* r e d o
* I-0 I-1 I-2 I-3 I-4
d D-1 R-1 R-2 M-2 I-3
o D-2 R-2 R-2 R-3 M-2
Pattern Matching 41
Changing “mane” to “mean”
* m e a n
* I-0 I-1 I-2 I-3 I-4
m D-1 M-0 I-1 I-2 I-3
a D-2 D-1 R-1 M-1 I-2
n D-3 D-2 R-2 D-2 M-1
e D-4 D-3 M-2 D-3 D-2
Pattern Matching 42
Longest Increasing Subsequence of single list
Find the longest increasing subsequence in a sequence of distinct integers.
Idea 1. Given a sequence of size less than m, can find the longest sequence of it. (Recursion)
The problem is that we don't know how to increase the length.
Case 1: It either can be added to the longest subsequence or not
Case 2: It is possible that it can be added to a non-selected subsequence (creating a sequence of equal length - but having a smaller ending point)
Case 3: It can be added to a non-selected sub-sequence creating a sequence of smaller length but successors make it a good choice.
Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11
Pattern Matching 43
Idea 2. Given a sequence of size string < m, we know how to find all the longest increasing subsequences.
Hard. There are many, and we need it for all lengths.
Pattern Matching 44
Idea 3 Given a sequence of size < m, can find the longest subsequence with the smallest ending point.
We might have to create a smaller subsequence, before we create a longer one.
Pattern Matching 45
Idea 4. Given a sequence of size <m, can find the best increasing sequence (BIS) for every length (k < m-1).
For each new item in the sequence, when we add it to the sequence of length 3 will it be better than the current sequence of length 4?
Pattern Matching 46
For s= 1 to n (or recursively the other way)
For k = s downto 1 until find correct spot
If BIS(k) > As and BIS(k-1) < As
BIS(k) = As
Pattern Matching 47
Actually, we don't need the sequential search as can do a binary search.
5 1 10 2 12 8 15 18 45 6 7 3 8 9 Length BIS 1 1 2 2 3 3 4 7 5 8
6 9 To output the sequence would be difficult as
you don't know where the sequence is. You would have to reconstruct.
Pattern Matching 48
Try: 8 1 4 2 9 10 3 5 14 11 12 7
Length End Pos 1stReplacement
2nd
Replacement
1 8 1
2 4 2
3 9 3
4 10 5
5 14 11 7
6 12
Pattern Matching 49
Probabilistic Algorithms
Suppose we wanted to find a number that is greater than the median (the number for which half are bigger).
We could sort them - O(n log n) and then select one. We could find the biggest - but stop looking half way
through. O(n/2) Cannot guarantee one in the upper half in less than
n/2 comparisons. What if you just wanted good odds? Pick two numbers, pick the larger one. What is
probability it is in the lower half?
Pattern Matching 50
There are four possibilities: both are lower the first is lower the other higher. the first is higher the other lower both are higher.
We will be right 75% of the time! We only lose if both are in the lowest half.
Pattern Matching 51
Select k elements and pick the biggest, the probability of being correct is 1 - 1/2k . Good odds - controlled odds.
Termed a Monte Carlo algorithm. It maygive the wrong result with very small
probability.Another type of probabilistic algorithm is one
that never gives a wrong result, but its running time is not guaranteed.
Termed Las Vegas algorithm as you are guaranteed success if you try long enough.
Pattern Matching 52
A coloring Problem: Las Vegas Style
Let S be a set with n elements. (n only effects complexity not algorithm)
Let S1, S2... Sk be a collection of distinct (in some way different)
subsets of S, each containing exactly r elements such that k 2r-2 . (Use this fact to bound the time)
GOAL: Color each element of S with one of two colors (red or blue) such that each subset Si contains at least one red and one blue element.
Pattern Matching 53
Idea Try coloring them randomly and then just
checking to see if you happen to win. Checking is fast, as you can quit checking each subset when you see one of each. You can quit checking the collection when any single color subset is found.
What is the probability that all items in a set are red? 1/2r
as equal probability that each color is assigned and r items in the set.
Pattern Matching 54
What is the probability that any one of the collection is all red?
k/2r Since we are looking for the or of a set of
probabilities, we add. k is bound by 2r-2 so k*1/2r <= 1/4 The probability of all blue or all red in a
single set is one half. (double probability of all red)
If our random coloring fails, we simply try again until success.
Our expected number of attempts is 2.
Pattern Matching 55
Finding a Majority Let E be a sequence of integers
x1,x2,x3, ... xn The multiplicity of x in E is the number of times x appears in E. A number z is a majority in E if its multiplicity is greater than n/2.
Problem: given a sequence of numbers, find the majority in the sequence or determine that none exists.
NOTE: we don’t want to merely find who has the most votes, but determine who has more than half of the votes.
Pattern Matching 56
For example, suppose there is an election. Candidates are represented as integers. Votes are represented as a list of candidate numbers.
We are assuming no limit of the number of possible candidates.
Pattern Matching 57
Ideas1. sort the list O(n log n)2. If have a balanced tree of candidate names,
complexity would be n log c (where c is number of candidates) Note, if we don’t know how many candidates, we can’t give them indices.
3. See if median (kth largest algorithm) occurs more than n/2 times. O(n)
4. Take a small sample. Find the majority - then count how many times it occurs in the whole list. This is a probabilistic approach (right?).
5. Make one pass - Discard elements that won’t affect majority.
Pattern Matching 58
Note: if xi xj and we remove both of
them, then the majority in the old list is the majority in the new list.
If xi is a majority, there are m xi’s out of n, where m > n/2. If we remove two elements, (m-1 > (n-2)/2).
The converse is not true. If there is no majority, removing two may make something a majority in the smaller list: 1,2,4,5,5.
Pattern Matching 59
Thus, our algorithm will find a possible majority.
Algorithm: find two unequal elements. Delete them. Find the majority in the smaller list. Then see if it is a majority in the original list.
How do we remove elements? It is easy. We scan the list in order.
We are looking for a pair to eliminate. Let i be the current position. All the items before
xi which have not been eliminated have the same value. All you really need to keep is the number of times, Occurs, this candidate, C value occurs (which has not been deleted).
Pattern Matching 60
For example:List: 1 4 6 3 4 4 4 2 9 0 2 4 1 4 2 2 3 2 4 2Occurs: X X 1 X 1 2 3 2 1 X 1 X 1 X 1 2 1 2 1 2Candidate: 1 6 4 4 4 4 4 ? 2 ? 1 ? 2 2 2 2 2 2
2 is a candidate, but is not a majority in the whole list.
Complexity: n-1 compares to find a candidate. n-1 compares to test if it is a majority.
So why do this over other ways? Simple to code. No different in terms of complexity, but interesting to think about.