Upload
anup-raghuveer
View
20
Download
1
Embed Size (px)
DESCRIPTION
Edit Distance solution
Citation preview
Edit Distance
Γιώργος ΠιερράκοςADS – NTUA
4 Ιουνίου 2007
Overview
What is edit distance? Why do we care?A few interesting propertiesSome algorithms from the past for
finding exact edit distanceMore (complicated) algorithms from
today for approximate edit distanceEmbedding edit distance into l1
What is that?
Suppose we have two strings x,y
e.g. x = kitten
y = sitting
and we want to transform x into y.
We use edit operations: 1. insertions
2. deletions
3. substitutions
What is that?
A closer look:
k i t t e n
s i t t i n g
1st step: kitten sitten (substitution)
2nd step: sittensittin (substitution)
3rd step: sittinsitting (insertion)
What is that?
Can we do better?
Answer here is no (obviously)
What about:
x = darladidirladada
y = marmelladara
Tough…
Why do we care?
A lot of applications depend on the similarity of two strings
Computational Biology:
…ATGCATACGATCGATT……TGCAATGGCTTAGCTA…
Animal species from the same family are bound to have more similar DNAs
What about evolutionary biology?
Why do we care?
searching keywords through the net: usually by “mtallica” we mean “metallica”:
Definitions
We care about bit-strings only: Σ = {0,1}n
for i..j<n we denote the substring of x: x[i..j] by xi we denote the i-th bit of x
Associate operations with string positions:
deleting xi ↔ isubstituting xi ↔ iinserting y ↔ position of y after the insertion
Alignment τ of x,y is a sequence of edit operations transforming x into y
Definitions
Length of an alignment is the number of its edit operations
Optimal alignment is one that uses a minimum number of edit operations
Edit Distance of two strings x,y is the length of their optimal alignment: ED(x,y)e.g. ED(kitten, sitting) = 3
Hamming Distance of two equal length strings x,y is the number of positions for which the corresponding symbols are different (xi ≠ yi)e.g. HD(kitten, sittin) = 2
Some interesting properties
1. Triangle Inequality: for any three strings x,y,z of arbitrary lengths:
ED(x,y) ≤ ED(x,z) + ED(z,y)
2. Splitting Inequality: let x, y be strings of lengths n and m respectively. For all i,j:
ED(x,y) ≤ ED(x[1..i],y[1..j])+ED(x[i+1..n],y[j+1..m])
Some interesting properties
3. Some simple upper and lower bounds: Let x, y be strings of lengths n, m (n ≤ m). Then: ED(x,y) ≤ m ED(x,y) ≥ m-n ED(x,y)=0 iff x=y if m=n, ED(x,y) ≤ HD(x,y) ED(x,y) ≥ number of characters (not
counting duplicates) found in x but not in y
Some interesting properties
4. Induced alignment property: insτ(i..j) = number of insertions in interval [i..j] delτ(i..j) = number of deletions in interval [i..j] subτ(i..j) = number of sub/tions in interval [i..j]
Caution: i,j refer to the initial positions of the string (see the string as an indexed array)
shτ(i..j) = insτ(i..j) - delτ(i..j)Intuitively shτ(i..j) is the induced shift on x[i..j]Define shτ(i) = shτ(1..i) and shτ(0) = 0
edτ(i..j) is the induced alignment of τ, i.e. the sub-sequence of edit operations in τ that belong in [i..j]
Some interesting properties
Consider the strings: x = savvato, y = eviva
edτ(4..5)=1 because we insert “i” in pos. 4
τ1 τ2 τ3
x savvatosav-vato
sav-vato
y eviva-- -eviva-- e-viva--
edτ(4..5) 1 1 1
Some interesting properties
Induced alignment property: For any alignment τ of x and y, for all i ≤ j,
edτ(i..j) ≥ ED (x[i..j], y[i+shτ(i-1)..j+ shτ(j)])
τ1 τ2 τ3
shτ(3), shτ(5) 0,0 -1,0 -1,0
x[4..5] va va va
y[4+shτ(3)..5+shτ(5)]) va iva iva
edτ(4..5) 1 1 1
ED(x[4..5], y[4+shτ(3)..5+shτ(5)])) 0 1 1
A Dynamic Programming Algorithm
Levenshtein ~1965 (Levenshtein distance) induced alignment property principle of optimality
(not exactly) use dynamic programming to solve the problem (quite
similar to subset sum problem) Key ideas:
input: strings s, t with lengths n, m respectively use an (n+1)x(m+1) matrix the invariant maintained throughout the algorithm is that we
can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations.
use a traceback algorithm to find the optimal alignment
A Dynamic Programming Algorithm
int LevenshteinDistance(char s[1..n], char t[1..m])
int d[0..n, 0..m]
for i from 0 to n d[i, 0] := i
for j from 1 to m d[0, j] := j
for i from 1 to n
for j from 1 to m
if s[i] = t[j] then cost := 0 else cost := 1
d[i, j] := minimum(
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + cost // substitution )
return d[n, m]
A Dynamic Programming Algorithm
branches correspond to different optimal alignments
the length of the optimal alignment is 5
the algorithm runs in O(n2) time and it can be improved so as to use O(n) space
A Dynamic Programming Algorithm
the above algorithm returns the exact value of the edit distance of two strings
there exists a (minor) improvement in the algorithm so as to run in time O(n2/logn) by Masek and Paterson (1980): only theoretical interest
O(n2) is too much! We want linear time. Even developing subquadratic time algorithms
for approximating edit distance within a modest factor has proved quite challenging
Approximating edit distance efficiently
Suppose we know that our strings are either too similar or too dissimilar, namely that their edit distance is either at most k, or greater than l(n,k). Then we can develop a gap algorithm, that decides which of the two holds.
We are going to present three linear algorithms (Yossef, Jayram et. al. FOCS 2004 ): The first one has l= O((kn)2/3) & works for all strings The second one has l= O(k2) & works only for strings
that are non-repetitive The third one is an n3/7-approximation algorithm
which improves to n1/3 if the strings are non-repetitive
Approximating edit distance efficiently
Why do we care about an efficient gap algorithm?
1. Such algorithms yield approximation algorithms that are as efficient, with the approximation ratio directly correlated with the gap
2. They are useful on their own: there exist problems where two strings can be either too similar or too dissimilar
Approximating edit distance efficiently
Our model for the first two algorithms: the sketching model: two-party public-coin simultaneous messages communication
complexity protocol persons: Alice, Bob and the Referee goal: to jointly compute f: AxB→C, when Alice has the input
and Bob has the input Alice uses her input a and shared random coins to compute a
sketch sA(a) and then sends it to the referee. Bob does the same and sends a sketch sB(b)
the referee uses the sketches and the shared coins to compute the value of the function f(a,b), with a small error probability (constant)
main measure of complexity: the sketches’ size usually desirable that Alice and Bob are efficient too
a A b B
Approximating edit distance efficiently
Why the sketching model?Because there already exist sketching algorithms for the Hamming Distance
Idea 1:map the Edit Distance into some form of Hamming Distance and then use the known results for the Hamming Distance
Idea 2:if two strings share a lot of identical substrings in near positions they cannot differ too much.
Gap algorithm for arbitrary strings
Algorithm for arbitrary strings Technique:
1. Map each string to the multiset of all its (overlapping) substrings. Annotate each substring with an “encoding” of its position in the original string.
2. Take the characteristic vectors of the two multisets and run a gap algorithm for their Hamming Distances.
3. Map the results for HD into results for ED
Gap algorithm for arbitrary strings
1st step:two inputs: the size of the input n
the gap parameter kdefine a suitable substring length B(n,k)
and a suitable window parameter D(n,k)map: x→Tx and y→Ty. These sets consist
of pairs (γ,i), where γ is a length B substring and i is the “encoding” of the substrings position
Gap algorithm for arbitrary strings
Encoding method: round down the starting position of the substring to the nearest multiple of D(n,k)
Example:Let x = savvato, B=2 and D=3.Then
Tx = {(sa,0),(av,0),(vv,1),(va,1),(at,1),(to,2)} formally:
Tx = {(x[i..i+B-1],i div D) | i = 1,…n-B+1} B = n2/3/(2k1/3) and D = n/B
Gap algorithm for arbitrary strings
2nd step: Take the characteristic vectors u, v of multisets Tx and
Ty respectively. To do that impose a (lexicographical) order on the elements ofExample:If Tx = {(γ1 , i1 ), (γ2 , i2 ), (γ1 , i3 )} and Ty = {(γ2 , i1 ), (γ1 , i3 ), (γ4 , i5 )} then
= {(γ1 , i1 ),(γ2 , i2 ),(γ1 , i3 ),(γ2 , i1 ),(γ4 , i5 )} order: (γ1 , i1 ), (γ1 , i3 ), (γ2 , i1 ), (γ2 , i2 ), (γ4 , i5 )Then u=11010 and v=01101
But we do not want to sort things (it costs…). So instead of the union consider the set of all (γj , ij )
x yT T
x yT T
Gap algorithm for arbitrary strings
a pair is indicative of substrings of x and y that match (i.e. they are identical in terms of contents and appear at nearby positions in x and y) and corresponds to a j such that uj = vj
a pair is indicative of substrings that do not match and corresponds to a j such that uj ≠ vj
We now have to estimate the Hamming Distance between u and v
( , ) x yi T T
( , ) ( \ ) ( \ )x y y xi T T T T
Gap algorithm for arbitrary strings
Hamming Distance can be approximated using constant-size sketches as shown by Kushilevitz, Ostrovsky, Rabani (2000)
Problem: the Hamming Distance instance our problem is reduced to is exponentially long (set of all (γj , ij )). As a result the time to produce the sketches is prohibitive (Alice and Bob are very slow…)
Solution: new improved sketching method which 1. produces constant size sketches 2. runs in time proportional to the HD of the instances3. solves the k vs (1+ε)k gap HD problem for any ε>0
Notice that HD(u,v) ≤ O(n): no more than 2n distinct substrings
Gap algorithm for arbitrary strings
3rd step: We tune the sketching algorithm for HD to
determine whether HD(u,v) ≤ 4kB or HD(u,v) ≥ 8kB with constant probability of error. The referee, upon receiving the sketches from Alice and Bob, decides that ED(x,y) ≤ k if he finds HD(u,v) ≤ 4kB and ED(x,y) ≥ 13 (kn)2/3 if he finds HD(u,v) ≥ 8kB.
The algorithm’s correctness follows from: Lemma 1: If ED(x,y) ≤ k then HD(u,v) ≤ 4kB Lemma 2: If ED(x,y) ≥ 13 (kn)2/3 then HD(u,v) ≥ 8kB
Gap algorithm for arbitrary strings
set map(string x)set T:= emptyfor i=0 to n div D
for j=1 to BT←(x[i*D+j.. i*D+j+B-1])
return T
int algorithm_1(string x, string y, int k)int B = n2/3/(2k1/3) int D = n/Bset Tx = map(x)set Ty = map(y)(u,v) = characteristic_vectors(Tx , Ty )if HD(u,v) ≤ 4kB then return (ED(x,y) ≤ k)if HD(u,v) ≥ 8kB then return (ED(x,y) ≥ 13 (kn)2/3 )
Gap algorithm for arbitrary strings
Remarks In the above algorithm Alice and Bob run the procedure
map and the referee runs the procedure HD, which is the sketching algorithm for Hamming Distance
Notice that Alice and Bob do not use their random coins. Only the referee uses randomness and has bounded error probability
The algorithm solves the k vs Ω((kn)2/3 ) gap problem for all k ≤ n1/2, in quasi-linear time, using sketches of size O(1)
The idea of independently encoding the position of each substring, though simple, has problems. Consider:x = paramana, y = mana, with B = 2, D = 3
Gap algorithm for nonrepetitive strings
We saw that “encoding” the substrings’ position independently, by choosing an appropriate integer D, can lead to problems: in fact we fail to identify many matches even in the presence of just one edit operation (consider the case of the strings x and 1x)
We overcome this handicap by resorting to a method where the “encodings” are correlated.
Gap algorithm for nonrepetitive strings
Idea:we scan the input from left to right, trying to find “anchor” substrings, i.e. identical substrings that occur in x, y at very near positions. All we need to change is the “encoding” of the position: we now map each substring to the region between successive “anchors”
Example:x = ma … ro ... as, y = *ma****ro***as**the substrings starting at a region between successive anchors have the same encoding
Gap algorithm for nonrepetitive strings
Why does this idea work?Remember that we are dealing with a gap algorithm. Hence, if the input strings are very similar, we expect that a sufficiently short substring, chosen from a sufficiently long window is unlikely to contain edit operations and thus has to be matched with a corresponding substring in y in the same window
And how do Alice and Bob choose identical substrings?Alice and Bob cannot communicate with each other, so they pick the “anchors” randomly. In fact they use some (shared) random permutations (remember the shared coins)
Gap algorithm for nonrepetitive strings
Isn’t this too good to be true?Yes, it is: in order for the random permutations to ensure that “anchors” are detected with high probability we must demand that the input strings are non-repetitive.
A string is called (t,l)-non-repetitive if for any window of size l, the l substrings of length t, which start inside the window are distinct
Gap algorithm for nonrepetitive strings
set map(string x, randomness r, int t) //Alice&Bob now use their coins rset T:= emptyint c:=1int i:=1table of anchors atable of regions reg//pick a sliding window of length W and place it at cfor all length t-substrings starting in the interval [c+W..c+2W-1]
produce a Karp-Rabin fingerprint using rpick a random permutation of the fingerprints using ra[i] = the substring with the minimal fingerprint permutationi:=i+1c:=the first character after the ancher
for j=1 to ireg[j] = substring starting after last char of a[i-1] till last char of a[i]T ←(reg[i],i)
return T
Gap algorithm for nonrepetitive strings
int algorithm_2(string x, string y, int k, int t)//t is the non-repetition parameterrandomness rset Tx = map(x,r,t)set Ty = map(y,r,t)(u,v) = characteristic_vectors(Tx , Ty )if HD(u,v) ≤ 3k then return (ED(x,y) ≤ k)if HD(u,v) ≥ 6k then return (ED(x,y) ≥ Ω (tk)2 )
Gap algorithm for nonrepetitive strings
The algorithm’s correctness follows from: Lemma 1:
If ED(x,y) ≤ k then Pr[HD(u,v) ≤ 3k] ≥ 5/6 Lemma 2:
If HD(u,v) ≤ 6kB then ED(x,y) ≤ O(tk)2
For any 1≤t<n, the algorithm solves the k vs Ω((tk)2) gap problem for all k ≤ (n/t)1/2, for all (t,tk)-non-repetitive strings, in polynomial time, using sketches of size O(1).
Approximation algorithm
Key idea: use of graphs Given two strings x,y the edit graph GE is a
representation of ED(x,y), by means of a directed graph.
The vertices correspond to the edit distances
of x[1..i] and y[1..j] for all i,j ≤n.
An edge between vertices corresponds to a single edit operation transforming one substring to the other.
Approximation algorithm
We define the graph G(B) as a (lossy) compression of GE: each vertex corresponds to a pair (i,s), where i=jB, for j=0..n/B and s=-k..k. The bigger parameter B is, the lossier the compression.
Each vertex is closely related with the edit distance of x[1..i] and y[1..i+s] (s denotes the amount by which we shift y with respect to x)
We have two types of edges:a. a-type edges from (i,s’) to (i,s) where |s’-s|=1b. b-type edges from (i-B,s) to (i,s) with w(e)
w(a-type edges) = 1, w(b-type edges) depends on approximation factor c
Approximation algorithm
In GE the weight of the shortest source-sink path corresponds to the optimal alignment of x, y.
Theorem 1: Given two strings x, y and their corresponding graph G(B), let the shortest path P from (0,0) to (n,0) have weight T. Then ED(x,y) ≤ T and T ≤ (2c+2)k, if ED(x,y) ≤k, where c is the approximation factor affiliated with b-type edge weights.
Approximation algorithm
Only problem: finding the shortest path.Dijkstra is too slow (i.e. not linear)If we could figure out the weights of all b-
type edges for a given i simultaneously, perhaps we could solve the problem.
But computing b-type edges is the same as finding the approximate edit distances between x[i+1..i+b] and every substring of y[i+1-k..i+B+k] of length B
Approximation algorithm
c(p,t)-edit pattern matching problem: given a pattern string P of length p and a text string of length t ≥ p, produce numbers d1, d2,…dt-p+1, such that
di /c ≤ ED(P,T[i..i+p-1]) ≤ di for all i Theorem 2: Suppose there is an algorithm that can
solve the c(p,t)-edit pattern matching problem in time TIME(p,t). Let x, y be two strings and G(B) their corresponding graph. Set p=B, t=B+2k and c=c(p,t). Then the shortest path in G(B) can be used to solve the k vs (2c+2)k edit distance gap problem in time O((k+TIME(p,t))n/B) (i.e. the shortest path can be efficiently computed)
Approximation algorithm
graph make_graph(string x, string y, int k, int B) //bigger B→faster algorithm, bigger gapvertices V = emptyedges E = emptyfor j = 0 to n div B //vertices
i = j*Bfor s = -k to k
V←(i,s) for j = 0 to n div B //a-type edges
i = j*Bfor s = -k to k
EA ← ((i,s),(i,s+1),1) //(source, sink, weight)EA ← ((i,s),(i,s-1),1)
for j = 1 to n div B //b-type edgesi = j*Bfor s = -k to k
EB ← ((i-B,s),(i,s),w)return (V, EA EB)
Approximation algorithm
(d1, d2,…dt-p+1) epm_algorithm(string P, string T)//length(T) ≥ length(P)//returns approximate ED(P,S) for all S: length(P)-substrings of T
int fix_weights(int x, int y, int B, int k, graph G, int n) //why n as input…?int d[-k..k] //table of b-type edges weights d = epm_algorithm(x[n-B+1..n], y[n-B+1-k..n+k])if n-B=0 return dfor s = -k to k
EB ← ((i-B,s),(i,s),d[s])G’ = update(G, EB)return compose(d, fix_weights(x,y,B,k,G’,n-B)) //what is compose…?
Let T(i,s) denote the shortest path from (0,0) to (i,s) Notice that fix_weights is recursive: in each call the T(i,s) is simultaneously (fast!)
computed for all s. The idea is to compute the shortest path T(n,0) by recursively computing the shortest paths T(i,s). T(i,s) uses the result T(i-B,s).
No more than n/B recursive calls of fix_weights are needed
Approximation algorithm
int shortest_path(string x, string y, int k, int B)
graph G = make_graph(x,y,k,B)
int T = fix_weights(x,y,k,B,G,n)
return T
Correctness of the gap algorithm (follows from Theorem 1):
1. If ED ≤ k then T ≤ (2c+2)k
2. If ED ≥ (2c+2)k then T ≥ ED ≥ (2c+2)k
Approximation algorithm
Some results which use algorithms for the edit pattern matching problem:
1. quasi-linear time algorithm for k vs O(k2)
2. quasi-linear time algorithm for k vs O(k7/4)
3. quasi-linear time algorithm for k vs O(k3/2) for (k, O(k1/2))-non-repetitive strings
4. 2&3 imply approximation algorithms with factors n3/7 and n1/3 respectively
Embedding edit distance into l1
Question: can {0,1}d endowed with edit distance be embedded into l1 (Rd endowed with L1-metric) with low distortion?
Known results up to 2005:1. Edit distance cannot be embedded into l1 with distortion
less than 3/22. Edit distance can be trivially embedded into l1 with
distortion d3. Edit distance can be embedded into l1 with distortion
(Ostrovsky, Rabani, STOC 2005) Why do we care?
1. Because embeddings allow us to use efficient algorithms of one metric space on instances of another metric space
2. Because embeddings help us learn more about the structure of a metric space
( log log log )2O d d
Embedding edit distance into l1
Let x be a string. Then for any integer s we denote by shifts(x,s) the set
{x[1,|x|-s+1], x[2,|x|-s+2],…, x[s,|x|]
which consists of all length s substrings of x. Theorem:
There exists a universal constant c>0 such that for every integer d>0 there exists an embedding f:({0,1}d, ED)→l1 with distortion at most log log log2c d d
Embedding edit distance into l1
Key idea of the embedding:for sufficiently small d the distortion is indeed that small. So it suffices to break down the string into substrings of approximately the same length (±1), the blocks, and recursively embed into l1 some metric spaces of lower dimension. The spaces embedded must be chosen in such a way, that the concatenation of their scaled embeddings results in the embedding of the original string.
log log log2 d d
Embedding edit distance into l1
Choosing the metric in the recursive step:Let x1, x2,… xn be the blocks. For all i consider the sets shifts(xi ,s) where s ranges over the non-negative powers of logd that are below the block length.Now define the distance between the sets shifts(xi ,s) and shifts(xj ,s) to be the minimum cost perfect matching, where the cost of an edge between 2 elements is their edit distance.This defines a metric that satisfies the above requirements. Ideally we want to embed it in l1
Embedding edit distance into l1
However a good embedding for the metric shifts(xi ,s) is too strong an inductive hypothesis.
Therefore we inductively embed the strings in shifts(xi ,s) into l1 and redefine the edge costs for the minimum cost perfect matching, to be the l1 distances of the embedded strings.
This “inductive” embedding is not one of low distortion, but it also does not need to be one. It just needs to satisfy a weaker property (omitted here) which guarantees the upper bound on the total distortion.
The End…