String Edit Distance Matching Problem With Moves

Preview:

DESCRIPTION

String Edit Distance Matching Problem With Moves. Graham Cormode S. Muthukrishna November 2001. Pattern Matching. Text T of length n Pattern P of length m Our goal : find “good” matches of P in T as measured by some edit distance function d(_,_) For every i= 1,2… n find: D[i] = back. - PowerPoint PPT Presentation

Citation preview

1

String Edit Distance MatchingProblem With Moves

Graham Cormode S. MuthukrishnaNovember 2001

2

Pattern Matching

Text T of length n Pattern P of length m

Our goal: find “good” matches of P in T asmeasured by some edit distance function d(_,_)

For every i=1,2…n find: D[i] =

back

)],:[d(Minj

PjiT

3

Pattern Matching example

T = “assasin” P = “ssi”

-D[1]=2 assasin Or assasin

_ssi ss i

-D[3]=1 assa sin s_si

-naively it will take O(mn^3)

4

The main idea

d(X,Y) = smallest number of operations to turn X into Y.

The operations are insertion, deletion, replacement of a character and moves of a substring.

The idea is to approximate D[i] up to a factor of O(LognLog*n)

5

General algorithm

Embed the string distance into L1 vectordistance, up to a O(LognLog*n) factor :

compute the vector with a single pass over the string.

Find the vector representation of O(n) substrings.

Do all this in O(nLogn) time. (a deterministicalgorithm with an aproximate result)

6

Edit Sensitive Parsing (ESP)for the embedding

We want to parse the string so that edit operations will have a limit effect on the parsing (an edit operation on char i changes the parsing only of the “neighborhood” of i )

For example:

“abcbbabcbdfgj” “xyzbbabcbdfgj”

“a bc bb abc b dfgj”“xy z bb abc b dfgj”

7

Edit Sensitive Parsing (ESP)for the embedding

In practice, find landmarks in the strings, based only on their locality and parse by them.

Local maxima are good landmarks, “abcegiklmrtabc” but may be far apart in large alphabets, so

we will reduce the alphabet.

8

Edit Sensitive Parsing (ESP)choosing the landmarks

We use a technique called Alphabet reduction:

Text: c a b a g e f a c e dBinary:010 000 001 000 110 100 101 000 010 100

011

Label: 00 2 1 0 3 2 1 0 3 2 3

Label(A[i]) = 2l + bit(l,A[i])l = location of first bit that is different between A[i] and A[i-1]

The value of the l bit in A[i]

9

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.)

- In each iteration the alphabet is reduced from Σ to 2Log|Σ|.- After Log*|Σ| iterations we get |Σ| < 7.- Then we reduce from 6 to 3, ensuring no adjacent

pairs are identical (start from 6 then 5 then 3):

Final iteration: 2 1 0 3 2 1 0 3 2 3

After reduction: 2 1 0 1 2 1 0 1 2 0

10

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.)

Properties of final labels:1) Final alphabet is {0,1,2}.2) No adjacent pair is identical.3) Takes Log*|Σ| iterations.4) Each label depends on O(Log*|Σ|) characters

to its left.

11

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

For repeats, parse in a regular way: aaaaaaa -> (aaa)(aa)(aa)

For varying substrings, use alphabet reduction then mark landmarks as follows:

12

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

Consider the final labels, Mark any character that is a local maxima (greater than left & right):

Text: c a b a g e f a c e dFinal: 010 2 1 0 1 2 1 0 1 2 0

13

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

Consider the final labels, Mark any character that is a local maxima (greater than left & right):

Text: c a b a g e f a c e dFinal: 010 2 1 0 1 2 1 0 1 2 0

Then mark any local minima if not adjacent to a marked char.

14

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

Consider the final labels, Mark any character that is a local maxima (greater than left & right):

Text: c a b a g e f a c e dFinal: 010 2 1 0 1 2 1 0 1 2 0

Then mark any local minima if not adjacent to a marked char.

Clearly, distance between marked labels is 2 or 3.

15

Edit Sensitive Parsing (ESP)What did we achieve so far?

By now the whole string has been arranged in pairs and triples.

The important outcome is that 2 strings with small edit distance will be parsed to a “very close” arrangement. (the parsing of each character depends on a Log*n neighborhood)

16

Edit Sensitive Parsing (ESP)constructing the ESP tree

We will now re-label each pair or triple – can be done by building a dictionary (Karp-Miller-Rosenberg) or by hashing (Karp-Rabin 1987).

Hash(w[0…m-1]) =(integer value of w[0…m-1]) mod q

For some large q

17

Edit Sensitive Parsing (ESP)constructing the ESP tree

In O(nLogn) construction time we get a 2-3 tree:

18

How do we represent a 2-3 tree as a vector?

The vector is the frequency of occurrence of each (level,label) pair:

(0,a) (0,b) (0,c) (0,d) (0,e) (0,f) (0,g) (0,_)

8 7 1 4 6 1 4 5 (1,2) (1,3) (1,6) (1,7) (1,8) (1,10) (1,12) (1,14) (1,16) (1,20) (1,21)

2 1 1 1 1 1 2 1 3 1 2(2,5) (2,7) (2,10) (2,13) (2,17) (2,20) (3,3) (3,15) (3,23) (4,10)

1 1 1 2 1 1 1 1 1 1

19

Proof of correctness

Theorem:

1/2d(X,Y) V(X) – V(Y) 1

O(lognlog*n)d(X,Y)

20

Upper bound proof:

V(X) – V(Y) 1 O(lognlog*n)d(X,Y)

Insert/change/delete a character: at most log*n + c nodes change per tree level.

Move a substring: the only changes are at the fringes, that is 4(log*n +c) nodes change per tree level.

since there are logn levels in the tree: Conclusion: each operation changes V by

O(lognlog*n) and there are d(X,Y) changes.

21

Lower bound proof:

d(X,Y) 2 V(X) – V(Y) 1

The Idea is to transform X into Y using at most

2 V(X) – V(Y) 1 operations:

We want to keep hold of large pieces of the stringthat are common to both X and Y.So we will go through and protect enough pieces

ofX that will be needed in Y.

22

Lower bound proof: (cont.)

We avoid doing any changes in the protected pieces.

At the first level of the tree we add or remove characters as needed. (if a character appears in Y and not in X, we add it to the end of X).

So we get V0(X)-V0(Y) 1 = 0. We proceed inductively up on the tree, then to

make any node in level i, we need to move at most 2 nodes from level i-1. (we know that on level i-1 we have enough nodes i.e: Vi-1(X)-Vi-

1(Y) 1 = 0)

23

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

J

C AB BA

F

CB

E

CB

E

K M

LMark protected pieces:

Counter=0

24

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

J

C AB BA

F

CB

E

CB

E

K M

LRemove and add characters as needed:

Counter=0Counter=1

(deletion)

A

Counter=2

(insertion)

25

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

J

AB BA

F

CB

E

CB

E

K M

L

A

Move to level 2, move nodes in level 1 as needed:

Counter=2

(insertion)

Counter=3

(1 move)

26

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

D

AB BA

F

CB

E

CB

E

K M

L

A

Move to level 3, move nodes in level 2 as needed:

This node will not move

Counter=5

(2 moves)

Counter=3

(1 move)

27

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

D

AB CB

E

BA

F

CB

E

G H

I

A

Counter=5

(2 moves)

That’s it!!

28

Lower bound proof: (example with d(X,Y)=1)

Did we achieve what we wanted? Y X

2 V(Y)-V(X) 1 = 2(#red nodes + # green nodes) = 18

And we surely created X from Y in less then 18 moves

DA AB CB

EBAF

CBE

G H

I

JC AB BA

FCBE

CBE

K ML

29

Application to string matching

To find D[i] , we need to compare P against all possible substrings of T. we can reduce this to O(n):

d(T[1,m],P) d(T[1,m], T[1,r]) + d(T[1,r], P) = |r–m| + d(T[1,r], P) 2d(T[1,r], P)

So we only need to consider O(n) substrings of length mAnd we get a 2-approximation of the optimal

matching !!

)( 2nO

Since we need at least |r–m| operations to make T[1,r] the same length as P.

30

Application to string matching – Final algorithm

Create ESP trees for T and P. Find V(T[1:m])-V(P) 1

Iteratively compute D[i] V(T[i:i+m-1])-V(P) 1

Overall we get O(nLogn) time cost for the wholealgorithm, and compute every D[i] up to a

factorof O(lognlog*n)

31

Application to string matching – Final algorithm example:

32