String Edit Distance Matching Problem With Moves

String Edit Distance MatchingProblem With Moves

Graham Cormode S. MuthukrishnaNovember 2001

Pattern Matching

Text T of length n Pattern P of length m

Our goal: find “good” matches of P in T asmeasured by some edit distance function d(_,_)

For every i=1,2…n find: D[i] =

)],:[d(Minj

Pattern Matching example

T = “assasin” P = “ssi”

-D[1]=2 assasin Or assasin

_ssi ss i

-D[3]=1 assa sin s_si

-naively it will take O(mn^3)

The main idea

d(X,Y) = smallest number of operations to turn X into Y.

The operations are insertion, deletion, replacement of a character and moves of a substring.

The idea is to approximate D[i] up to a factor of O(LognLog*n)

General algorithm

Embed the string distance into L1 vectordistance, up to a O(LognLog*n) factor :

compute the vector with a single pass over the string.

Find the vector representation of O(n) substrings.

Do all this in O(nLogn) time. (a deterministicalgorithm with an aproximate result)

Edit Sensitive Parsing (ESP)for the embedding

We want to parse the string so that edit operations will have a limit effect on the parsing (an edit operation on char i changes the parsing only of the “neighborhood” of i )

For example:

“abcbbabcbdfgj” “xyzbbabcbdfgj”

“a bc bb abc b dfgj”“xy z bb abc b dfgj”

Edit Sensitive Parsing (ESP)for the embedding

In practice, find landmarks in the strings, based only on their locality and parse by them.

Local maxima are good landmarks, “abcegiklmrtabc” but may be far apart in large alphabets, so

we will reduce the alphabet.

Edit Sensitive Parsing (ESP)choosing the landmarks

We use a technique called Alphabet reduction:

Text: c a b a g e f a c e dBinary:010 000 001 000 110 100 101 000 010 100

Label: 00 2 1 0 3 2 1 0 3 2 3

Label(A[i]) = 2l + bit(l,A[i])l = location of first bit that is different between A[i] and A[i-1]

The value of the l bit in A[i]

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.)

- In each iteration the alphabet is reduced from Σ to 2Log|Σ|.- After Log*|Σ| iterations we get |Σ| < 7.- Then we reduce from 6 to 3, ensuring no adjacent

pairs are identical (start from 6 then 5 then 3):

Final iteration: 2 1 0 3 2 1 0 3 2 3

After reduction: 2 1 0 1 2 1 0 1 2 0

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.)

Properties of final labels:1) Final alphabet is {0,1,2}.2) No adjacent pair is identical.3) Takes Log*|Σ| iterations.4) Each label depends on O(Log*|Σ|) characters

to its left.

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

For repeats, parse in a regular way: aaaaaaa -> (aaa)(aa)(aa)

For varying substrings, use alphabet reduction then mark landmarks as follows:

Consider the final labels, Mark any character that is a local maxima (greater than left & right):

Text: c a b a g e f a c e dFinal: 010 2 1 0 1 2 1 0 1 2 0

Then mark any local minima if not adjacent to a marked char.

Clearly, distance between marked labels is 2 or 3.

Edit Sensitive Parsing (ESP)What did we achieve so far?

By now the whole string has been arranged in pairs and triples.

The important outcome is that 2 strings with small edit distance will be parsed to a “very close” arrangement. (the parsing of each character depends on a Log*n neighborhood)

Edit Sensitive Parsing (ESP)constructing the ESP tree

We will now re-label each pair or triple – can be done by building a dictionary (Karp-Miller-Rosenberg) or by hashing (Karp-Rabin 1987).

Hash(w[0…m-1]) =(integer value of w[0…m-1]) mod q

For some large q

Edit Sensitive Parsing (ESP)constructing the ESP tree

In O(nLogn) construction time we get a 2-3 tree:

How do we represent a 2-3 tree as a vector?

The vector is the frequency of occurrence of each (level,label) pair:

(0,a) (0,b) (0,c) (0,d) (0,e) (0,f) (0,g) (0,_)

8 7 1 4 6 1 4 5 (1,2) (1,3) (1,6) (1,7) (1,8) (1,10) (1,12) (1,14) (1,16) (1,20) (1,21)

2 1 1 1 1 1 2 1 3 1 2(2,5) (2,7) (2,10) (2,13) (2,17) (2,20) (3,3) (3,15) (3,23) (4,10)

1 1 1 2 1 1 1 1 1 1

Proof of correctness

Theorem:

1/2d(X,Y) V(X) – V(Y) 1

O(lognlog*n)d(X,Y)

Upper bound proof:

V(X) – V(Y) 1 O(lognlog*n)d(X,Y)

Insert/change/delete a character: at most log*n + c nodes change per tree level.

Move a substring: the only changes are at the fringes, that is 4(log*n +c) nodes change per tree level.

since there are logn levels in the tree: Conclusion: each operation changes V by

O(lognlog*n) and there are d(X,Y) changes.

Lower bound proof:

d(X,Y) 2 V(X) – V(Y) 1

The Idea is to transform X into Y using at most

2 V(X) – V(Y) 1 operations:

We want to keep hold of large pieces of the stringthat are common to both X and Y.So we will go through and protect enough pieces

ofX that will be needed in Y.

Lower bound proof: (cont.)

We avoid doing any changes in the protected pieces.

At the first level of the tree we add or remove characters as needed. (if a character appears in Y and not in X, we add it to the end of X).

So we get V0(X)-V0(Y) 1 = 0. We proceed inductively up on the tree, then to

make any node in level i, we need to move at most 2 nodes from level i-1. (we know that on level i-1 we have enough nodes i.e: Vi-1(X)-Vi-

1(Y) 1 = 0)

Lower bound proof: example

A AB CB

C AB BA

LMark protected pieces:

Counter=0

A AB CB

C AB BA

LRemove and add characters as needed:

Counter=0Counter=1

(deletion)

Counter=2

(insertion)

A AB CB

Move to level 2, move nodes in level 1 as needed:

Counter=2

(insertion)

Counter=3

(1 move)

A AB CB

Move to level 3, move nodes in level 2 as needed:

This node will not move

Counter=5

(2 moves)

Counter=3

(1 move)

A AB CB

Counter=5

(2 moves)

That’s it!!

Lower bound proof: (example with d(X,Y)=1)

Did we achieve what we wanted? Y X

2 V(Y)-V(X) 1 = 2(#red nodes + # green nodes) = 18

And we surely created X from Y in less then 18 moves

DA AB CB

JC AB BA

Application to string matching

To find D[i] , we need to compare P against all possible substrings of T. we can reduce this to O(n):

d(T[1,m],P) d(T[1,m], T[1,r]) + d(T[1,r], P) = |r–m| + d(T[1,r], P) 2d(T[1,r], P)

So we only need to consider O(n) substrings of length mAnd we get a 2-approximation of the optimal

matching !!

)( 2nO

Since we need at least |r–m| operations to make T[1,r] the same length as P.

Application to string matching – Final algorithm

Create ESP trees for T and P. Find V(T[1:m])-V(P) 1

Iteratively compute D[i] V(T[i:i+m-1])-V(P) 1

Overall we get O(nLogn) time cost for the wholealgorithm, and compute every D[i] up to a

factorof O(lognlog*n)

Application to string matching – Final algorithm example:

String Edit Distance Matching Problem With Moves

Documents

String Matching

25 String Matching

11 String Matching

Introduction to String matching

Luca Tabarelli - Febbraio 20021 Algoritmi di String Matching String MatchingIl problema dello String Matching Forza BrutaAlgoritmo semplice di Forza Bruta

String Matching dengan Regular Expressioninformatika.stei.itb.ac.id/~rinaldi.munir/Stmik/2017-2018/String-Matching-dengan-Regex...String Matching dengan Regular Expression Masayu Leylia

String matching algorithm

String Matching Algorithms

Optimal Packed String Matching - DTU Research Database · be eciently implemented. The same specialized packed string matching instruction could also be used in other string matching

A parallel ‘String Matching Engine’ for use in high speed network … · and Non-deterministic Finite Automata. 2.3 String matching algorithms There are many string matching algorithms

17458 String Matching

3. Approximate String Matching

Aho-Corasick String Matching An Efficient String Matching

String Matching Problem.ppt

Semi-Numerical String Matching

06. string matching

Aho-Corasick String Matching

String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Approximate String Matching

15 String Matching