102
www.bioalgorithms. info COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

Embed Size (px)

Citation preview

Page 1: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Combinatorial Pattern Matching

Page 2: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Outline• Week 08:

• Quiz 2• Hash Tables• Repeat Finding• Exact Pattern Matching• Keyword Trees• Suffix Trees

• Week 09:• Heuristic Similarity Search Algorithms• Approximate String Matching• Filtration• Comparing a Sequence Against a Database• Algorithm behind BLAST• Statistics behind BLAST• PatternHunter and BLAT

Page 3: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Quiz 2

•closed book •marked out of 10•worth 5% of final grade•40 minutes

Page 4: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

timing

• begin

• 20 minutes to go

• 10 minutes to go

• 5 minutes

• STOP

Page 5: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Genomic Repeats

• Example of repeats:• ATGGTCTAGGTCCTAGTGGTC

• Motivation to find them:

• Genomic rearrangements are often associated with repeats

• To trace evolutionary secrets

• Many tumors are characterized by an explosion of repeats

Page 6: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Genomic Repeats

• The problem is often made more difficult by mutation: • ATGGTCTAGGACCTAGTGTTC

Page 7: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

l-mer Repeats

• Long repeats are difficult to find• Short repeats are easy to find (e.g., with

hashing)• A simple approach to finding long repeats:

• Find exact repeats of short l-mers (l is usually 10–13)

• Use l-mer repeats to potentially extend into longer, maximal repeats

Page 8: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

l-mer Repeats (cont’d)

• There are typically many locations where an l-mer is repeated:

GCTTACAGATTCAGTCTTACAGATGGT

• The 4-mer TTAC starts at locations 3 and 17

Page 9: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Extending l-mer Repeats

GCTTACAGATTCAGTCTTACAGATGGT

• Extend these 4-mer matches:

GCTTACAGATTCAGTCTTACAGATGGT

• Maximal repeat: TTACAGAT

Page 10: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Maximal Repeats

• To find maximal repeats in this way, we need ALL start locations of all l-mers in the genome

• Hashing lets us find repeats quickly in this manner

Page 11: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Hashing: What is it?

• What does hashing do?

• For different data, it generates a unique integer

• We store data in an array at the unique integer index generated from the data

• Hashing is a very efficient way to store and retrieve data (often stated as O(1))

Page 12: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Hashing: Definitions

• Hash table: array used in hashing

• Records: data stored in a hash table

• Keys: identifies sets of records

• Hash function: uses a key to generate an index to insert at in hash table

• Collision: when more than one record is mapped to the same index in the hash table

Page 13: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Hashing: Example

• Where do the animals eat?

• Records: each animal

• Keys: where each animal eats

Page 14: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Hashing DNA sequences

• Each l-mer can be translated into a binary string (A, T, C, G can be represented as 00, 01, 10, 11)

• After assigning a unique integer per l-mer it is easy to get all start locations of each l–mer in a genome

Page 15: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Hashing: Maximal Repeats

• To find repeats in a genome:• For all l-mers in the genome, note the start

position and the sequence• Generate a hash table index for each

unique l-mer sequence• In each index of the hash table, store all

genome start locations of the l-mer which generated that index

• Extend l-mer repeats to maximal repeats

Page 16: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Hashing: Collisions

• Dealing with collisions:

• “Chain” all start locations of l-mers (linked list)

Page 17: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Hashing: Summary

• When finding genomic repeats from l-mers:

• Generate a hash table index for each l-mer sequence

• In each index, store all genome start locations of the l-mer which generated that index

• Extend l-mer repeats to maximal repeats

Page 18: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Pattern Matching

• What if, instead of finding repeats in a genome, we want to find all sequences in a database that contain a given pattern?

• This leads us to a different problem, the Pattern Matching Problem

Page 19: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Pattern Matching Problem• Goal: Find all occurrences of a pattern in a text

• Input: Pattern p = p1…pn and text t = t1…tm

• Output: All positions 1< i < (m – n + 1) such that the n-letter substring of t starting at i matches p

• Motivation: Searching database for a known pattern

Page 20: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Exact Pattern Matching: A Brute-Force Algorithm

PatternMatching(p,t)1 n length of pattern p2 m length of text t3 for i 1 to (m – n + 1)4 if ti…ti+n-1 = p

5 output i

Page 21: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Exact Pattern Matching: An Example

• PatternMatching algorithm for:

• Pattern GCAT

• Text CGCATC

GCATCGCATC

GCATCGCATC

CGCATCGCAT

CGCATC

CGCATCGCAT

GCAT

Page 22: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Exact Pattern Matching: Running Time

• PatternMatching runtime: O(nm)

• Probability-wise, it’s more like O(m)

• Rarely will there be close to n comparisons in line 4

• But a better solution is to use suffix trees

• Can solve problem in O(m) time

• Conceptually related to keyword trees (next...)

Page 23: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Example

• Keyword tree:

• Apple

Page 24: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Example (cont’d)

• Keyword tree:

• Apple

• Apropos

Page 25: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Example (cont’d)

• Keyword tree:

• Apple

• Apropos

• Banana

Page 26: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Example (cont’d)

• Keyword tree:

• Apple

• Apropos

• Banana

• Bandana

Page 27: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Example (cont’d)

• Keyword tree:

• Apple

• Apropos

• Banana

• Bandana

• Orange

Page 28: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Properties

• Stores a set of keywords in a rooted labeled tree

• Each edge labeled with a letter from an alphabet

• Any two edges coming out of the same vertex have distinct labels

• Every keyword stored can be spelled on a path from root to some leaf

Page 29: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “appeal”

• appeal

Page 30: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “appeal”

• appeal

Page 31: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “appeal”

• appeal

Page 32: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “appeal”

• appeal

Page 33: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “apple”

• apple

Page 34: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “apple”

• apple

Page 35: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “apple”

• apple

Page 36: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “apple”

• apple

Page 37: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Thread “apple”

• apple

Page 38: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Multiple Pattern Matching Problem

• Goal: Given a set of patterns and a text, find all occurrences of any of patterns in text

• Input: k patterns p1,…,pk, and text t = t1…tm

• Output: Positions 1 < i < m where substring of t starting at i matches pj for 1 < j < k

• Motivation: Searching database for known multiple patterns

Page 39: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Multiple Pattern Matching: Straightforward Approach• Can solve as k “Pattern Matching Problems”

• Runtime:

O(kmn)

using the PatternMatching algorithm k times

• m - length of the text

• n - average length of the pattern

Page 40: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Multiple Pattern Matching: Keyword Tree Approach• Or, we could use keyword trees:

• Build keyword tree in O(N) time; N is total length of all patterns

• With naive threading: O(N + nm)

• Aho-Corasick algorithm: O(N + m)

Page 41: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading

• To match patterns in a text using a keyword tree:

• Build keyword tree of patterns

• “Thread” the text through the keyword tree

Page 42: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Keyword Trees: Threading (cont’d)

• Threading is “complete” when we reach a leaf in the keyword tree

• When threading is “complete,” we’ve found a pattern in the text

Page 43: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Suffix Trees=Collapsed Keyword Trees

• Similar to keyword trees, except edges that form paths are collapsed

• Each edge is labeled with a substring of a text

• All internal edges have at least two outgoing edges

• Leaves labeled by the index of the pattern.

Page 44: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Suffix Tree of a Text

• Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

Page 45: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Suffix Tree of a Text

• Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

Keyword Tree

Suffix Tree

How much time does it take?

Page 46: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Suffix Tree of a Text

• Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG TG G

quadratic Keyword Tree

Suffix Tree

Time is linear in the total size of all suffixes, i.e., it is quadratic in the length of the text

Page 47: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Suffix Trees: Advantages

• Suffix trees of a text is constructed for all its suffixes • Suffix trees build faster than keyword trees

ATCATG TCATG CATG ATG TG G

quadratic Keyword Tree

Suffix Tree

linear (Weiner suffix tree algorithm)

Page 48: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Use of Suffix Trees

• Suffix trees hold all suffixes of a text• i.e., ATCGC: ATCGC, TCGC, CGC, GC, C• Builds in O(m) time for text of length m

• To find any pattern of length n in a text:• Build suffix tree for text• Thread the pattern through the suffix tree• Can find pattern in text in O(n) time!

• O(n + m) time for “Pattern Matching Problem”• Build suffix tree and lookup pattern

Page 49: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Pattern Matching with Suffix TreesSuffixTreePatternMatching(p,t)1 Build suffix tree for text t2 Thread pattern p through suffix tree3 if threading is complete4 output positions of all p-matching leaves in the

tree5 else6 output “Pattern does not appear in text”

Page 50: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Suffix Trees: Example

Page 51: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Suffix arrays

• A related structure is the suffix array.• It can also be constructed in linear time and was

developed to save space.• The space requirement of a suffix array is

significantly less than that for a suffix tree.• The suffix array also forms the basis for the

Burrows-Wheeler Transform (BWT) which is behind the operation of the assembly program Bowtie

• Do look it up: it’s cool, even if it’s BTSOTC.

Page 52: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Multiple Pattern Matching: Summary

• Keyword and suffix trees are used to find patterns in a text

• Keyword trees:

• Build keyword tree of patterns, and thread text through it

• Suffix trees:

• Build suffix tree of text, and thread patterns through it

Page 53: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Approximate vs. Exact Pattern Matching• So far all we’ve seen are exact pattern

matching algorithms

• Usually, because of mutations, it makes much more biological sense to find approximate pattern matches

• Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches

Page 54: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Heuristic Similarity Searches

• Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow

• Alignment of two sequences usually has short identical or highly similar fragments

• Many heuristic methods (i.e., FASTA) are based on the same idea of filtration:

• Find short exact matches, and use them as seeds for potential match extension

• “Filter” out positions with no extendable matches

Page 55: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Dot Matrices

• Dot matrices show similarities between two sequences

• FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches)

Page 56: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Dot Matrices (cont’d)

• Identify diagonals above a threshold length

• Diagonals in the dot matrix indicate exact substring matching

Page 57: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Diagonals in Dot Matrices

• Extend diagonals and try to link them together, allowing for minimal mismatches/indels

• Linking diagonals reveals approximate matches over longer substrings

Page 58: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Approximate Pattern Matching Problem

• Goal: Find all approximate occurrences of a pattern in a text

• Input: A pattern p = p1…pn, text t = t1…tm, and k, the maximum number of mismatches

• Output: All positions 1 < i < (m – n + 1) such that ti…ti+n-1 and p1…pn have at most k mismatches (i.e., Hamming distance between ti…ti+n-1 and p < k)

Page 59: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Approximate Pattern Matching: A Brute-Force Algorithm

ApproximatePatternMatching(p, t, k)1 n length of pattern p2 m length of text t3 for i 1 to m – n + 14 dist 05 for j 1 to n6 if ti+j-1 != pj

7 dist dist + 18 if dist < k9 output i

Page 60: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Approximate Pattern Matching: Running Time

• That algorithm runs in O(nm).• Landau-Vishkin algorithm: O(kn)• We can generalize the “Approximate Pattern

Matching Problem” into a “Query Matching Problem”:• We want to match substrings in a query to

substrings in a text with at most k mismatches• Motivation: we want to see similarities to

some gene, but we may not know which parts of the gene to look for

Page 61: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Query Matching Problem

• Goal: Find all substrings of the query that approximately match the text

• Input: Query q = q1…qw, text t = t1…tm, n (length of matching substrings), k (maximum number of mismatches)• Output: All pairs of positions (i, j) such that the n-letter substring of q starting at i

approximately matches the n-letter substring of t starting at j, with at most k mismatches

Page 62: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Approximate Pattern Matching vs Query Matching

Page 63: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Query Matching: Main Idea

• Approximately matching strings share some perfectly matching substrings.

• Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy).

Page 64: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Filtration in Query Matching

• We want all n-matches between a query and a text with up to k mismatches

• “Filter” out positions we know do not match between text and query

• Potential match detection: find all matches of l-tuples in query and text for some small l

• Potential match verification: Verify each potential match by extending it to the left and right, until (k+1) mismatches are found

Page 65: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Filtration: Match Detection

• If x1…xn and y1…yn match with at most k mismatches, they must share an l-tuple that is perfectly matched, with l = n/(k+1) *

• Break string of length n into k+1 parts, each of length n/(k + 1)• k mismatches can affect at most k of these

k+1 parts

• At least one of these k+1 parts is perfectly matched (eh?)

* x is defined as the largest integer no bigger than x, so 1.2 = 1

Page 66: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Filtration: Match Detection (cont’d)

• Suppose k = 3. We would then have l=n/(k+1)=n/4:

• There are at most k mismatches in n, so at the very least there must be one out of the k+1 l-tuples without a mismatch (try putting k mismatches in the array such that no two of them are further than l apart)

1…l l +1…2l 2l +1…3l 3l +1…n

1 2 k k + 1

Page 67: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Filtration: Match Verification

• For each l -match we find, try to extend the match further to see if it is substantial

query

Extend perfect match of length luntil we find an approximate match of length n with k mismatchestext

Page 68: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Filtration: Example

k = 0 k = 1 k = 2 k = 3 k = 4 k = 5

l-tuplelength n n/2 n/3 n/4 n/5 n/6

Shorter perfect matches required

Performance decreases

Page 69: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Local alignment is too slow…

• Quadratic local alignment is too slow while looking for similarities between long strings (e.g., the entire GenBank database)

Page 70: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Local alignment is too slow…

• Quadratic local alignment is too slow while looking for similarities between long strings (e.g., the entire GenBank database)

• But:• Guaranteed to find the optimal local

alignment

• Sets the standard for sensitivity

Page 71: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Local alignment is too slow…

• Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)

• Basic Local Alignment Search Tool• Altschul, S., Gish, W., Miller, W.,

Myers, E. & Lipman, D.J.

Journal of Mol. Biol., 1990• Search sequence databases for

local alignments to a query

Page 72: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAST

• Great improvement in speed, with a modest decrease in sensitivity

• Minimizes search space instead of exploring entire search space between two sequences

• Finds short exact matches (“seeds”), only explores locally around these “hits”

Page 73: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

What Similarity Reveals

• BLASTing a new gene

• Evolutionary relationship

• Similarity between protein function

• BLASTing a genome

• Potential genes

Page 74: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAST algorithm

• Keyword search of all words of length w from the query of length n in database of length m with score above threshold

• w = 11 for DNA queries, w = 3 for proteins

• Local alignment extension for each found keyword

• Extend result until longest match above threshold is achieved

• Running time O(nm)

Page 75: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAST algorithm (cont’d)

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

keyword

GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11

neighbourhoodscore threshold

(T = 13)

Neighbourhoodwords

High-scoring Pair (HSP)

extension

Page 76: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Original BLAST

• Dictionary

• All words of length w

• Alignment

• Ungapped extensions until score falls below some statistical threshold

• Output

• All local alignments with score > threshold

Page 77: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Original BLAST: ExampleA C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

A• w = 4

• Exact keyword match of GGTC

• Extend diagonals with mismatches until score is under 50%

• Output resultGTAAGGTCCGTTAGGTCC

From lectures by Serafim Batzoglou (Stanford)

Page 78: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Gapped BLAST : Example• Original BLAST

exact keyword search, THEN:

• Extend with gaps around ends of exact match until score < threshold

• Output resultGTAAGGTCCAGTGTTAGGTC-AGT

A C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

A

From lectures by Serafim Batzoglou (Stanford)

Page 79: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Incarnations of BLAST

• blastn: Nucleotide-nucleotide

• blastp: Protein-protein

• blastx: Translated query vs. protein database

• tblastn: Protein query vs. translated database

• tblastx: Translated query vs. translated

database (6 frames each)

Page 80: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Incarnations of BLAST (cont’d)

• PSI-BLAST• Find members of a protein family or build a

custom position-specific score matrix• Megablast:

• Search longer sequences with fewer differences

• WU-BLAST: (Wash U BLAST)• Optimized, added features

Page 81: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Assessing sequence similarity

• We need to know how strong an alignment can be expected from chance alone

• “Chance” relates to comparison of sequences that are generated randomly based upon a certain sequence model

• Sequence models may take into account: • G+C content• Poly-A tails• “Junk” DNA • Codon bias• Etc.

Page 82: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAST: Segment Score

• BLAST uses scoring matrices () to improve on efficiency of match detection• Some proteins may have very different

amino acid sequences, but are still similar

• For any two l-mers x1…xl and y1…yl :• Segment pair: pair of l-mers, one from each

sequence• Segment score: li=1 (xi, yi)

Page 83: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAST: Locally Maximal Segment Pairs

• A segment pair is maximal if it has the best score over all segment pairs

• A segment pair is locally maximal if its score can’t be improved by extending or shortening

• Statistically significant locally maximal segment pairs are of biological interest

• BLAST finds all locally maximal segment pairs with scores above some threshold• A significantly high threshold will filter out

some statistically insignificant matches

Page 84: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAST: Statistics

• Threshold: Altschul-Dembo-Karlin statistics• Identifies smallest segment score that is unlikely to

happen by chance• # matches above has mean E() = Kmne-; K is a

constant, m and n are the lengths of the two compared sequences

• Parameter is positive root of

where px and py are frequencies of amino acids x and y, and A is the twenty letter amino acid alphabet

Page 85: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

P-values• The probability of finding b HSPs with a

score ≥ S is given by

• For b = 0, that chance is just

• Thus the probability of finding at least one HSP with a score ≥S is

Page 86: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Sample BLAST output Score E

Sequences producing significant alignments: (bits) Value

gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43

ALIGNMENTS>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]Length = 148

Score = 171 bits (434), Expect = 3e-44 Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)

Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60 MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPKSbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60

Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120 V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FGSbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120

Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 + F VQ A+QK +A V +AL +YHSbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

• Blast of human beta globin protein against zebra fish

Page 87: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Sample BLAST output (cont’d)

Score ESequences producing significant alignments: (bits) Value

gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33

ALIGNMENTS>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11 Length = 81706 Score = 149 bits (75), Expect = 3e-33 Identities = 183/219 (83%) Strand = Plus / Plus Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326 || ||| | || | || | |||||| ||||| ||||||||||| |||||||| Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468

Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 ||||||||| |||||||||| ||||| ||||||||||||Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

• Blast of human beta globin DNA against human DNA

Page 88: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Timeline

• 1970: Needleman-Wunsch global alignment algorithm

• 1981: Smith-Waterman local alignment algorithm• 1985: FASTA• 1990: BLAST (basic local alignment search tool)• 2000s: BLAST has become too slow in “genome vs.

genome” comparisons - new faster algorithms evolve!• PatternHunter• BLAT

Page 89: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

PatternHunter: faster and even more sensitive• BLAST: matches short

consecutive sequences (consecutive seed)

• Length = k

• Example (k = 11):

11111111111

Each 1 represents a “match”

• PatternHunter: matches short non-consecutive sequences (spaced seed)

• Increases sensitivity by locating homologies that would otherwise be missed

• Example (a spaced seed of length 18 with 11 “matches”):

111010010100110111

Each 0 represents a “don’t care”, so there can be a match or a mismatch

Page 90: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Spaced seeds

Example of a hit using a spaced seed:

How does this result in better sensitivity?

Page 91: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Why is PH better?

• BLAST: redundant hits

PatternHunter

This results in > 1 hit and creates clusters of redundant hits

This results in very few redundant hits

Page 92: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Why is PH better?

BLAST may also miss a hitGAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT

|| ||||||||| |||||| | |||||| ||||||

GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

In this example, despite a clear homology, there is no sequence of continuous matches longer than length 9. BLAST uses a length 11 and because of this, BLAST does not recognize this as a hit!

Resolving this would require reducing the seed length to 9, which would have a damaging effect on speed

9 9 matchesmatches

Page 93: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Why is PH better?

• Higher hit probability

• Lower expected number of random hits

Page 94: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Use of Multiple Seeds

Basic Searching Algorithm

1. Select a group of spaced seed models

2. For each hit of each model, conduct extension to find a homology.

Page 95: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Another method: BLAT

• BLAT (BLAST-Like Alignment Tool)

• Same idea as BLAST - locate short sequence hits and extend

Page 96: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAT vs. BLAST: Differences

• BLAT builds an index of the database and scans linearly through the query sequence,

whereas

• BLAST builds an index of the query sequence and then scans linearly through the database

• Index is stored in RAM which is memory intensive, but results in faster searches

Page 97: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAT: Fast cDNA Alignments

Steps:1. Break cDNA into 500 base chunks.2. Use an index to find regions in genome similar to

each chunk of cDNA.3. Do a detailed alignment between genomic regions

and cDNA chunk.4. Use dynamic programming to stitch together

detailed alignments of chunks into detailed alignment of whole.

Page 98: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

BLAT: Indexing

• An index is built that contains the positions of each k-mer in the genome

• Each k-mer in the query sequence is compared to each k-mer in the index

• A list of ‘hits’ is generated - positions in cDNA and in genome that match for k bases

Page 99: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Indexing: An ExampleHere is an example with k = 3:

Genome: cacaattatcacgaccgc3-mers (non-overlapping): cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15

cDNA (query sequence): aattctcac3-mers (overlapping): aat att ttc tct ctc tca cac 0 1 2 3 4 5 6

Hits: aat 0,3 cac 6,0 cac 6,9 clump: cacAATtatCACgaccgc

Multiple instances map to single index

Position of 3-mer in query, genome

Page 100: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

However…

• BLAT was designed to find sequences of 95% and greater similarity of length >40; may miss more divergent or shorter sequence alignments

Page 101: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

PatternHunter and BLAT vs. BLAST

• PatternHunter is 5-100 times faster than Blastn, depending on data size, at the same sensitivity

• BLAT is several times faster than BLAST, but best results are limited to closely related sequences

Page 102: COMP3456 – adapted from textbook slides Combinatorial Pattern Matching

www.bioalgorithms.infoCOMP3456 – adapted from textbook slides

Resources

• tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt• http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf• http://www.genomeblat.com/genomeblat/blatRapShow.pps