UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bioinformatics Algorithms and Data Structures

Chapter 12.2.4: k-difference Inexact Matching

Lecturer: Dr. RoseSlides by: Dr. Rose

February 21, 2002, year of the palindromeLast night at 2 minutes past 8pm it was: 20:02,20/02/2002


Technology


Overview

• k-difference inexact matching– Concepts:

• d-path

• Farthest-reaching d-path in a diagonal

– O(km) time and space solution

• Primer selection problem– Formulations:

• Exact matching primer

• Inexact matching primer

• k-difference primer

– O(km) time solution to k-difference primer problem


Technology


Overview

• Exclusion methods: fast expected time O(m)– Partition approaches:

• BYP algorithm– Aho-Corasick exact matching algorithm

» Keyword trees

– Back to Aho-Corasick exact matching algorithm

» Algorithm for computing failure links

• Back to BYP algorithm


Technology


K-difference Inexact Matching

• Like k-mismatch problem: allows mismatches• Harder than k-mismatch:

– allows spaces

– End spaces in T are not counted

– |P| & |T| can be vastly different can’t focus on a 2k+1 band centered around the diagonal.


Technology



Defn:– Diagonals above the main diagonal are numbered 1

through m. Diagonal i starts in cell (0,i).

– Diagonals below the main diagonal are numbered -1 through 1n. Diagonal -i starts in cell (i,0).

– Row 0 is initialized to be all zeros.• Recall T can have free end spaces

• Setting row 0 to be zeros allows the left end of T to start after a gap without any cost.


Technology



Defn: a d-path is a path that starts in row 0 and specifies exactly d mismatches & spaces.

Defn: a d-path is a farthest-reaching in diagonal i if it ends in diagonal i and the index of its ending column c is the ending column of any other d-path ending in diagonal i.

You can visualize this as a d-path that ends farthest in diagonal i.


Technology



Approach:• Iterate: (1d k )

– find the farthest-reaching d-path for each diagonal i, (-n i m)

• The farthest-reaching d-path for diagonal i is found from the farthest-reaching (d-1)-paths on diagonals i-1, i and i+1.

• Observation: and d-path reaching row n corresponds to a d-difference occurrence of P in T.


Technology



Observation: a farthest reaching 0-path in diagonal i is the longest match of T[i..m] and P[1..n].

Q: Why is this true?

A: 0-path means an exact match no deviation from the diagonal that you start on.

Using suffix trees:Build the suffix tree in linear time (linear in m).

Retrieve farthest-reaching 0-paths in constant time/path.


Technology



Q: How do we find the farthest-reaching d-path on diagonal i for d > 0?

A: The d-path for diagonal i depends on the previously found (d-1)-paths on diagonals i-1, i and i+1.

The 3 cases are:1. Path R1, the farthest-reaching (d-1)-path on diagonal

i+1, followed by a vertical edge to diagonal i.


Technology



Since R1 is a (d-1)-path on diagonal i+1, extending it by a vertical edge (adding a space in T) to diagonal i makes it a d-path on diagonal i.

i+1 i i-1 R1


Technology



The 2nd case is:2. Path R2, the farthest-reaching (d-1)-path on diagonal

i-1, followed by a horizontal edge to diagonal i.

Again extending a (d-1)-path into a d-path on diagonal i.i+1 i i-1 R2


Technology



3. Path R3, the farthest-reaching (d-1)-path on diagonal i, followed by a diagonal edge corresponding to a mismatch.

Again extending a (d-1)-path into a d-path on diagonal i.i+1 i i-1 R3


Technology



• Each of R1, R2, and R3, is initially a farthest-reaching (d-1)-path on diagonal i-1, i, i+1, respectively.

• Each is extended by a space or a mismatch resulting in a d-path on diagonal i.

• Each is subsequently extended along diagonal i.• The farthest-reaching d-path on diagonal i must

be one of these.


Technology


k-differences Algorithm

d = 0/* Calculate farthest-reaching 0-paths on diagonals 0 through m */ For i=0 to m { Find the longest common extension between P[1..n] and T[i..m]}

/* calculate d-paths by extending (d-1)-paths R1, R2, and R3 */For d=1 to k { For i = -n to m {

extend (d-1)-paths R1, R2, R3 on diagonals i-1, i, i+1 to diagonal i.One of these is the farthest reaching d-path on diagonal i.

} A path reaching row n defines an inexact match of P in T containing at most k differences. The column in row n indicates the end character in T.}


Technology



Space analysis:– For each d and i, we need to store the location of the

ending farthest-reaching d-path.• d ranges from 0 to k.

• There are (n+m) diagonals. O(km) space is required.


Technology



Time analysis:– Constant time to retrieve 3 (d-1)-paths for particular d

and i. O(km) for this aspect (like k-differences alignment)– Corresponding O(km) extensions of paths along

diagonal.• Each path extension is a maximal identical substring in P & T,

i.e., a longest common extension computation.• Using a suffix tree entails only constant time.• Creating the suffix tree entails linear processing of strings

O(n+m) altogether O(n+m+km) = O(km)


Technology


Primer (Probe) Selection Problem

Problem: start with two strings and (detailed description on page 178-179).

• Exact matching version: j > j0, find the shortest substring of starting at j s.t. .

– Can be solved in O(||+||)

– Not too bad.

• Inexact matching version: Given parameter p, j > j0, find the shortest substring starting at j that has edit distance at least ||/p from any substring in .


Technology



• Inexact matching version: Given parameter p, j > j0, find the shortest substring starting at j that has edit distance at least ||p from any substring in .

• Q: How much work is this?

• …find the shortest prefix of with edit distance at least ||p from any substring in .

• The naïve approach appears daunting.

• Let’s look at a less intimidating formulation!


Technology



• Change || p to k Convert the inexact matching problem to a k-

differences problem. This works out since in practice, || p must fall in a small

range for fixed p.

• k-difference primer problem: Given parameter k, j > j0, find the shortest substring starting at j that has edit distance at least k from any substring in .


Technology



Approach:For each position j in

Find the shortest prefix of [j..n] with edit distance k from every substring in .

Q: How does this compare with the k-differences inexact matching problem?

A: It is the opposite problem.Find matches with at most k differences,

versus

Reject matches of prefixes of [j..n] with substrings of with fewer than k differences.


Technology



Solution:– Use k-differences algorithm.– Use [j..n] in the place of P.– Use in the place of T.– Compute the farthest-reaching d-path, d = k, in each

diagonal.– d-paths, d < k, reaching row n, mean no solution at j– Q: Why? – A: a d-path, d < k, indicates [j..n] matches a substring

of with fewer than k differences.


Technology



Solution:– Only if no farthest-reaching (k-1)-paths reaches row n

can there be a primer at position j.– In particular, if no farthest-reaching (k-1)-paths

reaches row r < n then [j..r] is a primer if r is the smallest row with this property.

– Repeat this approach for every potential starting position j in .

• Analysis: if ||= n and || = m, then the algorithm takes time O(knm).


Technology


Exclusion Methods

Q: Can we improve on the (km) time we have seen for k-mismatch and k-difference?

A: On average, yes. (Are we quibbling?)

We adopt a fast expected algorithm < (km)

the worst case may not be better than (km)


Technology


Exclusion Methods

Partition Idea: exclude much of T from the search

Preliminaries:Let = ||, where is the alphabet used in P and T.

Let n = | P |, and m = | T |.

Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences.

General Partition algorithm: three phases1. Partition phase

2. Search Phase

3. Check Phase


Technology


Exclusion Methods

1. Partition phase• Partition either T or P into r-length regions (depends on

particular algorithm)

2. Search Phase• Use exact matching to search T for r-length intervals

• These are potential targets for approximate occurrences of P.

• Eliminate as many intervals as possible.

3. Check Phase• Use approximate matching to check for an approximate

occurrence of P around each surviving interval for the search phase.


Technology


BYP Method

BYP method has O(m) expected running time.Partition P into r-length regions, r = n/(k+1)Q: How many r-length regions of P are there?

A: k+1, there may be an additional short region.

Suppose there is a match of P & T with at most k differences.

Q: What can we deduce about the corresponding r-length regions?

A:There must be at least one r-length interval that exactly matches.


Technology


BYP Method

BYP Algorithm:

1. Let P be the set of the first k+1 substrings of P’s partitioning.

2. Build a keyword tree for the set of patterns P.3. Use Aho-Corasik to find I, the set of starting locations in

T where a pattern in P occurs exactly.

4. …..

Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.


Technology


Keyword Trees (section 3.4)

Defn. The keyword tree for set P is a rooted directed tree K satisfying:

1. Each edge is labeled with one character

2. Any two edges out of the same node have distinct labels.

3. Every pattern Pi in P maps to some node v of K s.t. the path from the root to v spells out Pi

4. Every leaf in K is mapped by some pattern in P.


Technology


Keyword Trees

Example: From textbook P = {potato, poetry, pottery, science, school}

p

o t

a

t o

1

t

e r

y

e t

r y

s

c i

e

n c

e

h o o l

3 2 4

5


Technology



Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K.

1. Every node corresponds to a prefix of a pattern in P.

2. Conversely, every prefix of a pattern maps to a node in K.

p

o t

a

t o

1

t

e r

y

e t

r y

s

c i

e

n c

e

h o o l

3 2 4

5


Technology



• If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed .

• Let Ki denote the partial keyword tree that encodes patterns P1,.. Pi of P.


Technology



• Consider partial keyword tree K1

– comprised of a single path of |P1| edges out of root r.

– Each edge is labeled with one character of P1

– Reading from the root to the leaf spells out P1

– The leaf is labeled 1p

o t

a

t o

1


Technology



Creating K2 from K1:

1. Find the longest path from the root of K1 that matches a prefix of P2.

2. This paths ends bya) Either exhausting the characters of P2 or

b) Ending at some existing node v in K1 where no extending match is possible.

In case 2a) label the node where the path ends 2.

In case 2b) create a new path out of v, labeled by the remaining characters of P2.


Technology



Example: P1 is potato

a) P2 is pot

b) P2 is pottyp

o t

a

t o

1

p

o t

a

t o

1

t y

2

Case b) Case a)

2


Technology



Use of keyword trees for matching• Finding occurrences of patterns in P that occur

starting at position l in T:– Starting at the root r in K, follow the unique path that

matches a substring of T that starts at l.– Numbered nodes along this path indicate matched

patterns in P that start at position l.– This takes time proportional to min(n, m)– Traversing K for each position l in T gives O(nm)– This can be improved!


Technology


Keyword Tree Speedup

Observation: Our naïve keyword tree is like the naïve approach to string comparison.

Every time we increment l, we start all over at the root of K O(nm)

Recall: KMP avoided O(nm) by shifting to get a speedup.

Q: Is there an analogous operation we can perform in K ?A: Of course, why else would I ask a rhetorical question?


Technology



First, we assume Pi Pj for all combinations Pi,Pj in P.

Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v.

Defn. Let L(v) denote the label of node v.

Defn. Let lp(v) denote the length of the longest proper suffix of string L(v) that is a prefix of some pattern in P.


Technology



Example: L(v) = potat, lp(v) = 2, the suffix at is the prefix of P4.

p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4 v


Technology



Note: if is the lp(v)-length suffix of L(v), then there is a unique node labeled .

Example: at is the lp(v)-length suffix of L(v), w is the unique node labeled at.

p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4 v

w


Technology



Defn: For node v of K let nv be the unique node in K labeled with the suffix of L(v) of length lp(v). When lp(v) = 0 then nv is the root of K.

Defn: The ordered pair (v,nv) is called a failure link.

Example:p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4


Technology


Aho-Corasick (section 3.4.6)

Algorithm AC searchl = 1;

c = 1;

w = root of K;

Repeat {

While there is an edge (w,w´) labeled character T(c) {

if w´ is numbered by pattern i then

report that Pi occurs in T starting at position l;

w= w´ and c = c + 1;

}

w = nw and l = c - lp(w);

} Until c > m;

Note: if the root fails to match increment c and the repeat loop again.


Technology


Aho-Corasick

Example: T = hotpotattach

p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4

When l = 4 there is a match of pot, but the next position fails.

At this point c = 9. The failure link points to the node labeled at and lp(v) = 2. l = c – lp(v) = 9 – 2 = 7


Technology


Computing nv in Linear Time

• Note: if v is the root r or 1 character away from r, then nv = r.

• Imagine nv has been computed for for every node that is exactly k or fewer edges from r.

• How can we compute nv for v, a node k+1 edges from r?


Technology


Computing nv in Linear Time

• We are looking for nv and L(nv).

• Let v´ be the parent of v in K and x the character on the edge connecting them.

• nv´ is known since v´ is k edges from r.

• Clearly, L(nv) must be a suffix of L(nv´) followed by x.

– First check if there is an edge (nv´,w´) with label x.

– If so, then nv = w´.

– O/w L(nv) is a proper suffix of L(nv´) followed by x.

• Examine nnv´ for an outgoing edge labeled x.

• If no joy, keep repeating, finally setting nv = r, if we run out of edges.


Technology


BYP Method

BYP method has O(m) expected running time.Partition P into r-length regions, r = n/(k+1)Q: How many r-length regions of P are there?

A: k+1, there may be an additional short region.

Suppose there is a match of P & T with at most k differences.

Q: What can we deduce about the corresponding r-length regions?

A:There must be at least one r-length interval that exactly matches.


Technology


BYP Method

BYP Algorithm:

1. Let P be the set of the first k+1 substrings of P’s partitioning.

2. Build a keyword tree for the set of patterns P.3. Use Aho-Corasik to find I, the set of starting locations in

T where a pattern in P occurs exactly.

4. For each i I use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]

Documents

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference