46
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k- difference Inexact Matching Lecturer: Dr. Rose Slides by: Dr. Rose February 21, 2002, year of the palindrome Last night at 2 minutes past 8pm it was:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

Embed Size (px)

Citation preview

Page 1: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bioinformatics Algorithms and Data Structures

Chapter 12.2.4: k-difference Inexact Matching

Lecturer: Dr. RoseSlides by: Dr. Rose

February 21, 2002, year of the palindromeLast night at 2 minutes past 8pm it was: 20:02,20/02/2002

Page 2: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Overview

• k-difference inexact matching– Concepts:

• d-path

• Farthest-reaching d-path in a diagonal

– O(km) time and space solution

• Primer selection problem– Formulations:

• Exact matching primer

• Inexact matching primer

• k-difference primer

– O(km) time solution to k-difference primer problem

Page 3: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Overview

• Exclusion methods: fast expected time O(m)– Partition approaches:

• BYP algorithm– Aho-Corasick exact matching algorithm

» Keyword trees

– Back to Aho-Corasick exact matching algorithm

» Algorithm for computing failure links

• Back to BYP algorithm

Page 4: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

• Like k-mismatch problem: allows mismatches• Harder than k-mismatch:

– allows spaces

– End spaces in T are not counted

– |P| & |T| can be vastly different can’t focus on a 2k+1 band centered around the diagonal.

Page 5: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Defn:– Diagonals above the main diagonal are numbered 1

through m. Diagonal i starts in cell (0,i).

– Diagonals below the main diagonal are numbered -1 through 1n. Diagonal -i starts in cell (i,0).

– Row 0 is initialized to be all zeros.• Recall T can have free end spaces

• Setting row 0 to be zeros allows the left end of T to start after a gap without any cost.

Page 6: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Defn: a d-path is a path that starts in row 0 and specifies exactly d mismatches & spaces.

Defn: a d-path is a farthest-reaching in diagonal i if it ends in diagonal i and the index of its ending column c is the ending column of any other d-path ending in diagonal i.

You can visualize this as a d-path that ends farthest in diagonal i.

Page 7: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Approach:• Iterate: (1d k )

– find the farthest-reaching d-path for each diagonal i, (-n i m)

• The farthest-reaching d-path for diagonal i is found from the farthest-reaching (d-1)-paths on diagonals i-1, i and i+1.

• Observation: and d-path reaching row n corresponds to a d-difference occurrence of P in T.

Page 8: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Observation: a farthest reaching 0-path in diagonal i is the longest match of T[i..m] and P[1..n].

Q: Why is this true?

A: 0-path means an exact match no deviation from the diagonal that you start on.

Using suffix trees:Build the suffix tree in linear time (linear in m).

Retrieve farthest-reaching 0-paths in constant time/path.

Page 9: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Q: How do we find the farthest-reaching d-path on diagonal i for d > 0?

A: The d-path for diagonal i depends on the previously found (d-1)-paths on diagonals i-1, i and i+1.

The 3 cases are:1. Path R1, the farthest-reaching (d-1)-path on diagonal

i+1, followed by a vertical edge to diagonal i.

Page 10: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Since R1 is a (d-1)-path on diagonal i+1, extending it by a vertical edge (adding a space in T) to diagonal i makes it a d-path on diagonal i.

i+1 i i-1 R1

Page 11: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

The 2nd case is:2. Path R2, the farthest-reaching (d-1)-path on diagonal

i-1, followed by a horizontal edge to diagonal i.

Again extending a (d-1)-path into a d-path on diagonal i.i+1 i i-1 R2

Page 12: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

3. Path R3, the farthest-reaching (d-1)-path on diagonal i, followed by a diagonal edge corresponding to a mismatch.

Again extending a (d-1)-path into a d-path on diagonal i.i+1 i i-1 R3

Page 13: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

• Each of R1, R2, and R3, is initially a farthest-reaching (d-1)-path on diagonal i-1, i, i+1, respectively.

• Each is extended by a space or a mismatch resulting in a d-path on diagonal i.

• Each is subsequently extended along diagonal i.• The farthest-reaching d-path on diagonal i must

be one of these.

Page 14: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

k-differences Algorithm

d = 0/* Calculate farthest-reaching 0-paths on diagonals 0 through m */ For i=0 to m { Find the longest common extension between P[1..n] and T[i..m]}

/* calculate d-paths by extending (d-1)-paths R1, R2, and R3 */For d=1 to k { For i = -n to m {

extend (d-1)-paths R1, R2, R3 on diagonals i-1, i, i+1 to diagonal i.One of these is the farthest reaching d-path on diagonal i.

} A path reaching row n defines an inexact match of P in T containing at most k differences. The column in row n indicates the end character in T.}

Page 15: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Space analysis:– For each d and i, we need to store the location of the

ending farthest-reaching d-path.• d ranges from 0 to k.

• There are (n+m) diagonals. O(km) space is required.

Page 16: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

K-difference Inexact Matching

Time analysis:– Constant time to retrieve 3 (d-1)-paths for particular d

and i. O(km) for this aspect (like k-differences alignment)– Corresponding O(km) extensions of paths along

diagonal.• Each path extension is a maximal identical substring in P & T,

i.e., a longest common extension computation.• Using a suffix tree entails only constant time.• Creating the suffix tree entails linear processing of strings

O(n+m) altogether O(n+m+km) = O(km)

Page 17: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Primer (Probe) Selection Problem

Problem: start with two strings and (detailed description on page 178-179).

• Exact matching version: j > j0, find the shortest substring of starting at j s.t. .

– Can be solved in O(||+||)

– Not too bad.

• Inexact matching version: Given parameter p, j > j0, find the shortest substring starting at j that has edit distance at least ||/p from any substring in .

Page 18: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Primer (Probe) Selection Problem

• Inexact matching version: Given parameter p, j > j0, find the shortest substring starting at j that has edit distance at least ||p from any substring in .

• Q: How much work is this?

• …find the shortest prefix of with edit distance at least ||p from any substring in .

• The naïve approach appears daunting.

• Let’s look at a less intimidating formulation!

Page 19: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Primer (Probe) Selection Problem

• Change || p to k Convert the inexact matching problem to a k-

differences problem. This works out since in practice, || p must fall in a small

range for fixed p.

• k-difference primer problem: Given parameter k, j > j0, find the shortest substring starting at j that has edit distance at least k from any substring in .

Page 20: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Primer (Probe) Selection Problem

Approach:For each position j in

Find the shortest prefix of [j..n] with edit distance k from every substring in .

Q: How does this compare with the k-differences inexact matching problem?

A: It is the opposite problem.Find matches with at most k differences,

versus

Reject matches of prefixes of [j..n] with substrings of with fewer than k differences.

Page 21: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Primer (Probe) Selection Problem

Solution:– Use k-differences algorithm.– Use [j..n] in the place of P.– Use in the place of T.– Compute the farthest-reaching d-path, d = k, in each

diagonal.– d-paths, d < k, reaching row n, mean no solution at j– Q: Why? – A: a d-path, d < k, indicates [j..n] matches a substring

of with fewer than k differences.

Page 22: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Primer (Probe) Selection Problem

Solution:– Only if no farthest-reaching (k-1)-paths reaches row n

can there be a primer at position j.– In particular, if no farthest-reaching (k-1)-paths

reaches row r < n then [j..r] is a primer if r is the smallest row with this property.

– Repeat this approach for every potential starting position j in .

• Analysis: if ||= n and || = m, then the algorithm takes time O(knm).

Page 23: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Exclusion Methods

Q: Can we improve on the (km) time we have seen for k-mismatch and k-difference?

A: On average, yes. (Are we quibbling?)

We adopt a fast expected algorithm < (km)

the worst case may not be better than (km)

Page 24: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Exclusion Methods

Partition Idea: exclude much of T from the search

Preliminaries:Let = ||, where is the alphabet used in P and T.

Let n = | P |, and m = | T |.

Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences.

General Partition algorithm: three phases1. Partition phase

2. Search Phase

3. Check Phase

Page 25: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Exclusion Methods

1. Partition phase• Partition either T or P into r-length regions (depends on

particular algorithm)

2. Search Phase• Use exact matching to search T for r-length intervals

• These are potential targets for approximate occurrences of P.

• Eliminate as many intervals as possible.

3. Check Phase• Use approximate matching to check for an approximate

occurrence of P around each surviving interval for the search phase.

Page 26: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

BYP Method

BYP method has O(m) expected running time.Partition P into r-length regions, r = n/(k+1)Q: How many r-length regions of P are there?

A: k+1, there may be an additional short region.

Suppose there is a match of P & T with at most k differences.

Q: What can we deduce about the corresponding r-length regions?

A:There must be at least one r-length interval that exactly matches.

Page 27: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

BYP Method

BYP Algorithm:

1. Let P be the set of the first k+1 substrings of P’s partitioning.

2. Build a keyword tree for the set of patterns P.3. Use Aho-Corasik to find I, the set of starting locations in

T where a pattern in P occurs exactly.

4. …..

Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.

Page 28: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees (section 3.4)

Defn. The keyword tree for set P is a rooted directed tree K satisfying:

1. Each edge is labeled with one character

2. Any two edges out of the same node have distinct labels.

3. Every pattern Pi in P maps to some node v of K s.t. the path from the root to v spells out Pi

4. Every leaf in K is mapped by some pattern in P.

Page 29: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees

Example: From textbook P = {potato, poetry, pottery, science, school}

p

o t

a

t o

1

t

e r

y

e t

r y

s

c i

e

n c

e

h o o l

3 2 4

5

Page 30: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees (section 3.4)

Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K.

1. Every node corresponds to a prefix of a pattern in P.

2. Conversely, every prefix of a pattern maps to a node in K.

p

o t

a

t o

1

t

e r

y

e t

r y

s

c i

e

n c

e

h o o l

3 2 4

5

Page 31: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees (section 3.4)

• If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed .

• Let Ki denote the partial keyword tree that encodes patterns P1,.. Pi of P.

Page 32: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees (section 3.4)

• Consider partial keyword tree K1

– comprised of a single path of |P1| edges out of root r.

– Each edge is labeled with one character of P1

– Reading from the root to the leaf spells out P1

– The leaf is labeled 1p

o t

a

t o

1

Page 33: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees (section 3.4)

Creating K2 from K1:

1. Find the longest path from the root of K1 that matches a prefix of P2.

2. This paths ends bya) Either exhausting the characters of P2 or

b) Ending at some existing node v in K1 where no extending match is possible.

In case 2a) label the node where the path ends 2.

In case 2b) create a new path out of v, labeled by the remaining characters of P2.

Page 34: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees (section 3.4)

Example: P1 is potato

a) P2 is pot

b) P2 is pottyp

o t

a

t o

1

p

o t

a

t o

1

t y

2

Case b) Case a)

2

Page 35: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Trees (section 3.4)

Use of keyword trees for matching• Finding occurrences of patterns in P that occur

starting at position l in T:– Starting at the root r in K, follow the unique path that

matches a substring of T that starts at l.– Numbered nodes along this path indicate matched

patterns in P that start at position l.– This takes time proportional to min(n, m)– Traversing K for each position l in T gives O(nm)– This can be improved!

Page 36: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Tree Speedup

Observation: Our naïve keyword tree is like the naïve approach to string comparison.

Every time we increment l, we start all over at the root of K O(nm)

Recall: KMP avoided O(nm) by shifting to get a speedup.

Q: Is there an analogous operation we can perform in K ?A: Of course, why else would I ask a rhetorical question?

Page 37: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Tree Speedup

First, we assume Pi Pj for all combinations Pi,Pj in P.

Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v.

Defn. Let L(v) denote the label of node v.

Defn. Let lp(v) denote the length of the longest proper suffix of string L(v) that is a prefix of some pattern in P.

Page 38: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Tree Speedup

Example: L(v) = potat, lp(v) = 2, the suffix at is the prefix of P4.

p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4 v

Page 39: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Tree Speedup

Note: if is the lp(v)-length suffix of L(v), then there is a unique node labeled .

Example: at is the lp(v)-length suffix of L(v), w is the unique node labeled at.

p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4 v

w

Page 40: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Keyword Tree Speedup

Defn: For node v of K let nv be the unique node in K labeled with the suffix of L(v) of length lp(v). When lp(v) = 0 then nv is the root of K.

Defn: The ordered pair (v,nv) is called a failure link.

Example:p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4

Page 41: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Aho-Corasick (section 3.4.6)

Algorithm AC searchl = 1;

c = 1;

w = root of K;

Repeat {

While there is an edge (w,w´) labeled character T(c) {

if w´ is numbered by pattern i then

report that Pi occurs in T starting at position l;

w= w´ and c = c + 1;

}

w = nw and l = c - lp(w);

} Until c > m;

Note: if the root fails to match increment c and the repeat loop again.

Page 42: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Aho-Corasick

Example: T = hotpotattach

p

o t

a

t o

1

t

e r

y

e t

r y

a

t

t a

c h

3 2

4

When l = 4 there is a match of pot, but the next position fails.

At this point c = 9. The failure link points to the node labeled at and lp(v) = 2. l = c – lp(v) = 9 – 2 = 7

Page 43: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Computing nv in Linear Time

• Note: if v is the root r or 1 character away from r, then nv = r.

• Imagine nv has been computed for for every node that is exactly k or fewer edges from r.

• How can we compute nv for v, a node k+1 edges from r?

Page 44: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Computing nv in Linear Time

• We are looking for nv and L(nv).

• Let v´ be the parent of v in K and x the character on the edge connecting them.

• nv´ is known since v´ is k edges from r.

• Clearly, L(nv) must be a suffix of L(nv´) followed by x.

– First check if there is an edge (nv´,w´) with label x.

– If so, then nv = w´.

– O/w L(nv) is a proper suffix of L(nv´) followed by x.

• Examine nnv´ for an outgoing edge labeled x.

• If no joy, keep repeating, finally setting nv = r, if we run out of edges.

Page 45: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

BYP Method

BYP method has O(m) expected running time.Partition P into r-length regions, r = n/(k+1)Q: How many r-length regions of P are there?

A: k+1, there may be an additional short region.

Suppose there is a match of P & T with at most k differences.

Q: What can we deduce about the corresponding r-length regions?

A:There must be at least one r-length interval that exactly matches.

Page 46: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

BYP Method

BYP Algorithm:

1. Let P be the set of the first k+1 substrings of P’s partitioning.

2. Build a keyword tree for the set of patterns P.3. Use Aho-Corasik to find I, the set of starting locations in

T where a pattern in P occurs exactly.

4. For each i I use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]