Upload
remedios-carrillo
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dynamic Programming and Biological Sequence Comparison. Part I. Topic II – Biological Sequence Alignment and Database Search. Part I (Topic-2a): Dynamic programming and Sequence comparison Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment - PowerPoint PPT Presentation
Citation preview
Dynamic Programming and Biological Sequence Comparison
Part I
\course\eleg667-01-f\Topic-2a.ppt 2
Topic II – Biological Sequence Alignment and Database Search
Part I (Topic-2a): Dynamic programming and Sequence comparison
Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment
Part III (Topic-2c): Multiple sequence alignment
\course\eleg667-01-f\Topic-2a.ppt 3
Outline
Concept of alignment
Two algorithm design techniques;
Dynamic Programming: Examples
Applying DP to Sequence Comparison;
The database search problem
Heuristic algorithms to database search
\course\eleg667-01-f\Topic-2a.ppt 4
Alignment
The two sequences will have the same length (after possible insertions of spaces on either or both of them)
No space in one sequence can be aligned with a space in the other
Spaces can be inserted at the beginning or end of the sequences
\course\eleg667-01-f\Topic-2a.ppt 5
Biological Sequence Alignment and Database Search
1. We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur.
2. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other.
\course\eleg667-01-f\Topic-2a.ppt 6
3. We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity.
4. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar.
5. We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others.
(cont’d)
\course\eleg667-01-f\Topic-2a.ppt 7
Breaking Problems Down:
Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem.
Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved.
Two Related Algorithm Design Techniques
\course\eleg667-01-f\Topic-2a.ppt 8
Divide and Conquer – Example 1
9 1 25 4 15 4 1 9 25 15
becomes
4 1
25 15 becomes
becomes 1
4 15 25
1 4 15 25
Quick Sort
\course\eleg667-01-f\Topic-2a.ppt 9
Divide and Conquer – Example 2
The Fibonacci numbers
Fib(n){ if (n < 2) return 1; else return Fib(n-1)+Fib(n-2);}
F1 = 1, F2 = 1
Fn = Fn-1 + Fn-2
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …
\course\eleg667-01-f\Topic-2a.ppt 10
Divide and Conquer – Example 2F1 = 1, F2 = 1
Fn = Fn-1 + Fn-2
F(7)
F(3)
+
F(2) F(1)
F(4)
+
F(2)
F(6)
+
F(3)
+
F(2) F(1)F(3)
+
F(2) F(1)
F(4)
+
F(2)
F(5)
+
+
F(3)
+
F(2) F(1)F(3)
+
F(2) F(1)
F(4)
+
F(2)
F(5)
+
n 1 2 3 4 5 6 7 8 9 10 11 …Fn 1 1 2 3 5 8 13 21 34 55 89 …
Fn / Fn-1 1.6 Fn 1.6n, n >> 1
T(n) #Internal_nodes = #leaves - 1but #leaves = Fn
T(n) = O(1.6n)Exponential
Time!
\course\eleg667-01-f\Topic-2a.ppt 11
How to Compute Fib Function Using Dynamic Programming
Method?
\course\eleg667-01-f\Topic-2a.ppt 12
Dynamic Programming–Example 1
Fib(n) { int tab[n];
tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n];}
Start by solving thesmallest problems
Use the partial solutions to solvebigger and bigger problems
Extra memory to store intermediate values
1
1
2
3
5
8
13
21
34
55
89
….
tab
LinearTime!T(n) = O(n) Space-Time Tradeoff
\course\eleg667-01-f\Topic-2a.ppt 13
Sequence Comparison
Molecular sequence data are at the heart of Computational Biology
DNA sequences RNA sequences Protein sequences
We can think of these sequences as strings of letters DNA & RNA: alphabet of 4 letters (A,T,C,G) Protein: alphabet of 20 letters
code full nameA alanineC cysteineD aspartateE glutamateF phenylalanineG glycineH histidineI isoleucineK lysineL leucineM methionineN aspartamineP prolineQ glutamineR arginineS serineT threonineV valineW tryptophanY tyrosine
\course\eleg667-01-f\Topic-2a.ppt 14
Sequence Comparison – (Cont.)
Why compare sequences? Find similar genes/proteins
Allows to predict function & structure
Locate common subsequences in genes/proteins Identify common recurrent patterns
Locate sequences that might overlap Help in sequence assembly
\course\eleg667-01-f\Topic-2a.ppt 15
Sequence X = A T A A G T
Sequence Y = A T G C A G T
To compare the sequences we need to quantify the similariy
matches = 1mismatches = 0
Score 1 1 0 0 0 0 0
Total = 2
Sequence Comparison – (Cont.)
\course\eleg667-01-f\Topic-2a.ppt 16
Sequence Y = A T G C A G T
Sequence X = A T A A G T
Sequence Comparison – (Cont.)
Sequence X = A T A A G T
Taking positions of the letters into account
matches = 1mismatches = 0
Score 0 0 0 0 1 1 1
Total = 3
\course\eleg667-01-f\Topic-2a.ppt 17
Sequence Y = A T G C A G T
Sequence X = A T A A G T
Sequence Comparison – (Cont.)
Sequence X = A T A - A G T
How to take possible mutations into account?
matches = 1mismatches = 0gap = -1
Score 1 1 0 –1 1 1 1
Total = 4
matches = 1mismatches = 0
\course\eleg667-01-f\Topic-2a.ppt 18
Applying DP to Sequence ComparisonSequence X = GASequence Y = AG
G -
-A
G - - A
GA
- GA -
GA - -
- -AG
GA - - - A
GA- A
G - A - A -
G - -- AG
GAA -
G -AG
- GAA - -
- G -A -G
- GAG
- - GAG -
GA - - - - AG
GA -- AG
G - A - AG
G - A - - A -G
G - - A- AG -
GA -A -G
GAAG
G - AAG -
- GA -A - -G
- GAA -G
- G - AA -G -
- GAAG -
- - GAAG - -
scores
-1 -1
-2 -2 0 -2 -2
-3 0 -3 -3 -1 -1 -3 -3 0 -3
-4 -1 -4 -2 -4 -2 0 -2 -4 -2 -4 -1 -4
T(n,n) = O(kn)
ExponentialTime!
choose the best score, i.e max(-2, 0, -2)choose the best score, i.e max(-3, 0, -1)choose the best score, i.e max(-1, 0, -3)choose the best score, i.e max(-1, 0, -1)total score = 0
\course\eleg667-01-f\Topic-2a.ppt 19
G A
A
G
Applying DP to Sequence ComparisonSequence X = GASequence Y = AG
G -
-A
G - - A
GA
- GA -
GA - -
- -AG
GA - - - A
GA- A
G - A - A -
G - -- AG
GAA -
G -AG
- GAA - -
- G -A -G
- GAG
- - GAG -
GA - - - - AG
GA -- AG
G - A - AG
G - A - - A -G
G - - A- AG -
GA -A -G
GAAG
G - AAG -
- GA -A - -G
- GAA -G
- G - AA -G -
- GAAG -
- - GAAG - -
-1 -1
-2 -2 0 -2 -2
-3 0 -3 -3 -1 -1 -3 -3 0 -3
-4 -1 -4 -2 -4 -2 0 -2 -4 -2 -4 -1 -4
0
0 -1 -2
-2
-1
0 0
0
T(n,n) = O(n2)
PolynomialTime!
\course\eleg667-01-f\Topic-2a.ppt 20
Questions
Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?
Answer: Let us count Total = 13
G A 0 -1 -2
A -1 0 0
G -2 0 0
3 5 7
1 2 4
6 8 9
Question: from 1 to 9 how many paths?
1
3 5 2
86
9 9 9 9 9 99
9 9 9
9 9 9
8 7
8 78
5
5
8 7
477
\course\eleg667-01-f\Topic-2a.ppt 21
DP algorithm for Sequence Comparison
int S[m,n]
m = length(X)n = length(Y)for i = 0 to m do S[i,0] = i . gfor j = 0 to n do S[j,0] = j . gfor i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g )return S[m,n]
sb[i,j] - Substitution Matrix
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
A T C G
A
T
C
G
Start by solving thesmallest problems
Extra memory to store intermediate values
Use the partial solutions to solve bigger and
bigger problems
\course\eleg667-01-f\Topic-2a.ppt 22
The Substitution Matrix
For DNA we usually use identity matrices;
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
A T C G
A
T
C
G
For proteins more sensitive matrices, derived empirically, are used;
A B C D E F G H I K L M N P Q R S T V W Y Z
A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3
\course\eleg667-01-f\Topic-2a.ppt 23
Sequence Comparison revisited
A T G C A G T
A
T
A
A
G
T
-1 -2 -3 -4 -5
0 2 1 0 -1 -2 -3
-1 1 2 1 1 0 -1
-2 0 1 2 2 1 0
-3 -1 1 1 2 3 2
0 -1 -2 -3
-1
-2
-3
-4 -5 -6
-4
-5
-7
-6 -4 -2 0 1 1 2 4
Similarity Matrix
int S[m,n]
m = length(X)n = length(Y)for i = 0 to m do S[i,0] = i . gfor j = 0 to n do S[j,0] = j . gfor i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g )return S[m,n]
1
1-1 + (-1) 0 + (+1)-1 + (-1)
0
0-2 + (-1)-1 + ( 0 ) 1 + (-1)
-1-3 + (-1)-2 + ( 0 ) 0 + (-1)
-2-4 + (-1)-3 + ( 0 ) -1 + (-1)
-3-5 + (-1)-4 + (+1)-2 + (-1)
-5-7 + (-1)-6 + ( 0 )-4 + (-1)
-4-6 + (-1)-5 + ( 0 )-3 + (-1)
\course\eleg667-01-f\Topic-2a.ppt 24
What To Do Next?
Answer: Finding alignments
But, How?
\course\eleg667-01-f\Topic-2a.ppt 25
Finding the Alignment(s)
A T G C A G T
A
T
A
A
G
T
1 0 -1 -2 -3 -4 -5
0 2 1 0 -1 -2 -3
-1 1 2 1 1 0 -1
-2 0 1 2 2 1 0
-3 -1 1 1 2 3 2
0 -1 -2 -3
-1
-2
-3
-4 -5 -6
-4
-5
-7
-6 -4 -2 0 1 1 2 4
Similarity Matrix
42 + (-1)3 + (+1)2 + (-1)
TT
31 + (-1)2 + (+1)2 + (-1)
G TG T
21 + (-1)1 + (+1)2 + (-1)
A G TA G T
10 + (-1)1 + ( 0 )2 + (-1)
C A G TA A G T
C A G T - A G T
1-1 + (-1)0 + ( 0 )2 + (-1)
G C A G T - A A G T
1-1 + (-1)0 + (+1)-1 + (-1)
21 + (-1)2 + ( 0 )1 + (-1)
G C A G TA - A G T
20 + (-1)1 + (+1)0 + (-1)
T G C A G TT - A A G T
T G C A G TT A - A G T
A T G C A G TA T A - A G T
A T G C A G TA T - A A G T
Global Alignments
\course\eleg667-01-f\Topic-2a.ppt 26
How to Break a Tie?
Should one report all?
Or, report only one?
\course\eleg667-01-f\Topic-2a.ppt 27
Advantage of DP Alignment Algorithms
Build up the solution by determining all similarities between arbitrary prefixes of the two sequences
Starting with the shorter prefixes and use previously computed results to solve for larger prefixes
\course\eleg667-01-f\Topic-2a.ppt 28
The Complexity of the DP Alignment Algorithm?
Find an optimal alignment
O (m + n)
Construction of the similarity matrix:
O (m • n)
\course\eleg667-01-f\Topic-2a.ppt 29
Global versus Local Alignments
A global alignment attempts to match all of one sequence against all of another
LGPSTKQFGKGSSSRIWDN| |||| | | LNQIERSFGKGAIMRLGDA
A local alignment attempts to match subsequences of the two sequences;
-------FGKG-------- |||| -------FGKG--------
\course\eleg667-01-f\Topic-2a.ppt 30
How to Compute Local Alignment?
\course\eleg667-01-f\Topic-2a.ppt 31
Applying DP to Local Alignment
Similarity Matrix Computation:
a[i,j-1]+g
a[i,j]= max a[i-1,j-1]+sb(i,j)
a[i-1,j]+g
0
0
0
0
0 0 0 0 0
..
..
a[i,0]= 0 ; for i= 0…m
a[0,j]= 0 ; for j= 0…n
If the best alignment up to somepoint has a negative score, it’s better to start a new one, rather
than extend the old one.
Don’t penalize gaps on leftand right ends!
\course\eleg667-01-f\Topic-2a.ppt 32
Criteria of Finding a Local Alignment
Find the entries with maximum values in the simularity matrix
For each of such entries, construct an local alignment
See next example
We may also be interested in near-optimal alignments
\course\eleg667-01-f\Topic-2a.ppt 33
A T G C A G T
A
T
A
A
G
T
1 0 0 0 1 0 0
0 2 1 0 0 1 1
1 1 2 1 1 0 1
1 1 1 2 2 1 0
0 0 2 1 2 3 2
0 0 0 0
0
0
0
0 0 0
0
0
0
0 0 1 1 2 1 2 4
Similarity Matrix
Similarity Matrix Computation:
a[i,j-1]+g
a[i,j]= max a[i-1,j-1]+sb(i,j)
a[i-1,j]+g
0
A T G C A G TA T - A A G T
A T G C A G TA T A - A G T
A T G CA A G T
Applying DP to Local Alignment
\course\eleg667-01-f\Topic-2a.ppt 34
Local Alignment using DPT G A T G G A G G T
G
A
T
A
G
G
0 1 0 0 1 1 0 1 1 0
0 0 0 0
0
0
0
0 0 0
0
0
0
0
0 0 0
0 0 2 0 0 0 2 0 0 0
1 0 0 3 1 0 0 1 0 1
0 0 1 1 2 0 1 0 0 0
0 1 0 0 2 3 1 2 1 0
0 1 0 0 1 3 1 2 3 1
0
0 + (-2)0 + (-1)0 + (-2)0
1
0 + (-2)0 + (+1)0 + (-2)0
T G A T G G A G G T A G G
a[i,j-1]+g
a[i-1,j-1]+sb(i,j)
a[i-1,j]+g
0
a[i,j]= max
1 -1 -1 -1
-1 1 -1 -1
-1 -1 1 -1
-1 -1 -1 1
A T C G
A
T
C
G
g = -2 T G A T - G G A G G T G A T A G G
T G A T G G A G G T G A T A G
T G A T G G A G G T G A T
\course\eleg667-01-f\Topic-2a.ppt 35
How to Break a Tie?
Should one report all?
Or, report only one?
\course\eleg667-01-f\Topic-2a.ppt 36
Extension to the Basic DP Method
Improving space complexity Introduce general gap functions
That is, the probability of a sequence of consecutive spaces is more likely than individual spaces
Affine gap functions: w(k) = h + gk