Dynamic Programming and Biological Sequence Comparison

Dynamic Programming and Biological Sequence Comparison

Part I

\course\eleg667-01-f\Topic-2a.ppt 2

Topic II – Biological Sequence Alignment and Database Search

Part I (Topic-2a): Dynamic programming and Sequence comparison

Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment

Part III (Topic-2c): Multiple sequence alignment


Outline

Concept of alignment

Two algorithm design techniques;

Dynamic Programming: Examples

Applying DP to Sequence Comparison;

The database search problem

Heuristic algorithms to database search


Alignment

The two sequences will have the same length (after possible insertions of spaces on either or both of them)

No space in one sequence can be aligned with a space in the other

Spaces can be inserted at the beginning or end of the sequences


Biological Sequence Alignment and Database Search

1. We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur.

2. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other.


3. We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity.

4. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar.

5. We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others.

(cont’d)


Breaking Problems Down:

Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem.

Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved.

Two Related Algorithm Design Techniques


Divide and Conquer – Example 1

9 1 25 4 15 4 1 9 25 15

becomes

4 1

25 15 becomes

becomes 1

4 15 25

1 4 15 25

Quick Sort


Divide and Conquer – Example 2

The Fibonacci numbers

Fib(n){ if (n < 2) return 1; else return Fib(n-1)+Fib(n-2);}

F1 = 1, F2 = 1

Fn = Fn-1 + Fn-2

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …


Divide and Conquer – Example 2F1 = 1, F2 = 1

Fn = Fn-1 + Fn-2

F(7)

F(3)

+

F(2) F(1)

F(4)

+

F(2)

F(6)

+

F(3)

+

F(2) F(1)F(3)

+

F(2) F(1)

F(4)

+

F(2)

F(5)

+

+

F(3)

+

F(2) F(1)F(3)

+

F(2) F(1)

F(4)

+

F(2)

F(5)

+

n 1 2 3 4 5 6 7 8 9 10 11 …Fn 1 1 2 3 5 8 13 21 34 55 89 …

Fn / Fn-1 1.6 Fn 1.6n, n >> 1

T(n) #Internal_nodes = #leaves - 1but #leaves = Fn

T(n) = O(1.6n)Exponential

Time!


How to Compute Fib Function Using Dynamic Programming

Method?


Dynamic Programming–Example 1

Fib(n) { int tab[n];

tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n];}

Start by solving thesmallest problems

Use the partial solutions to solvebigger and bigger problems

Extra memory to store intermediate values

1

1

2

3

5

8

13

21

34

55

89

….

tab

LinearTime!T(n) = O(n) Space-Time Tradeoff


Sequence Comparison

Molecular sequence data are at the heart of Computational Biology

DNA sequences RNA sequences Protein sequences

We can think of these sequences as strings of letters DNA & RNA: alphabet of 4 letters (A,T,C,G) Protein: alphabet of 20 letters

code full nameA alanineC cysteineD aspartateE glutamateF phenylalanineG glycineH histidineI isoleucineK lysineL leucineM methionineN aspartamineP prolineQ glutamineR arginineS serineT threonineV valineW tryptophanY tyrosine


Sequence Comparison – (Cont.)

Why compare sequences? Find similar genes/proteins

Allows to predict function & structure

Locate common subsequences in genes/proteins Identify common recurrent patterns

Locate sequences that might overlap Help in sequence assembly


Sequence X = A T A A G T

Sequence Y = A T G C A G T

To compare the sequences we need to quantify the similariy

matches = 1mismatches = 0

Score 1 1 0 0 0 0 0

Total = 2







Taking positions of the letters into account


Score 0 0 0 0 1 1 1

Total = 3





Sequence X = A T A - A G T

How to take possible mutations into account?

matches = 1mismatches = 0gap = -1

Score 1 1 0 –1 1 1 1

Total = 4



Applying DP to Sequence ComparisonSequence X = GASequence Y = AG

G -

-A

G - - A

GA

- GA -

GA - -

- -AG

GA - - - A

GA- A

G - A - A -

G - -- AG

GAA -

G -AG

- GAA - -

- G -A -G

- GAG

- - GAG -

GA - - - - AG

GA -- AG

G - A - AG

G - A - - A -G

G - - A- AG -

GA -A -G

GAAG

G - AAG -

- GA -A - -G

- GAA -G

- G - AA -G -

- GAAG -

- - GAAG - -

scores

-1 -1

-2 -2 0 -2 -2

-3 0 -3 -3 -1 -1 -3 -3 0 -3

-4 -1 -4 -2 -4 -2 0 -2 -4 -2 -4 -1 -4

T(n,n) = O(kn)

ExponentialTime!

choose the best score, i.e max(-2, 0, -2)choose the best score, i.e max(-3, 0, -1)choose the best score, i.e max(-1, 0, -3)choose the best score, i.e max(-1, 0, -1)total score = 0


G A

A

G

Applying DP to Sequence ComparisonSequence X = GASequence Y = AG

G -

-A

G - - A

GA

- GA -

GA - -

- -AG

GA - - - A

GA- A

G - A - A -

G - -- AG

GAA -

G -AG

- GAA - -

- G -A -G

- GAG

- - GAG -

GA - - - - AG

GA -- AG

G - A - AG

G - A - - A -G

G - - A- AG -

GA -A -G

GAAG

G - AAG -

- GA -A - -G

- GAA -G

- G - AA -G -

- GAAG -

- - GAAG - -

-1 -1

-2 -2 0 -2 -2

-3 0 -3 -3 -1 -1 -3 -3 0 -3

-4 -1 -4 -2 -4 -2 0 -2 -4 -2 -4 -1 -4

0

0 -1 -2

-2

-1

0 0

0

T(n,n) = O(n2)

PolynomialTime!


Questions

Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?

Answer: Let us count Total = 13

G A 0 -1 -2

A -1 0 0

G -2 0 0

3 5 7

1 2 4

6 8 9

Question: from 1 to 9 how many paths?

1

3 5 2

86

9 9 9 9 9 99

9 9 9

9 9 9

8 7

8 78

5

5

8 7

477


DP algorithm for Sequence Comparison

int S[m,n]

m = length(X)n = length(Y)for i = 0 to m do S[i,0] = i . gfor j = 0 to n do S[j,0] = j . gfor i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g )return S[m,n]

sb[i,j] - Substitution Matrix

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

A T C G

A

T

C

G

Start by solving thesmallest problems

Extra memory to store intermediate values

Use the partial solutions to solve bigger and

bigger problems


The Substitution Matrix

For DNA we usually use identity matrices;

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

A T C G

A

T

C

G

For proteins more sensitive matrices, derived empirically, are used;

A B C D E F G H I K L M N P Q R S T V W Y Z

A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3


Sequence Comparison revisited

A T G C A G T

A

T

A

A

G

T

-1 -2 -3 -4 -5

0 2 1 0 -1 -2 -3

-1 1 2 1 1 0 -1

-2 0 1 2 2 1 0

-3 -1 1 1 2 3 2

0 -1 -2 -3

-1

-2

-3

-4 -5 -6

-4

-5

-7

-6 -4 -2 0 1 1 2 4

Similarity Matrix

int S[m,n]

m = length(X)n = length(Y)for i = 0 to m do S[i,0] = i . gfor j = 0 to n do S[j,0] = j . gfor i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g )return S[m,n]

1

1-1 + (-1) 0 + (+1)-1 + (-1)

0

0-2 + (-1)-1 + ( 0 ) 1 + (-1)

-1-3 + (-1)-2 + ( 0 ) 0 + (-1)

-2-4 + (-1)-3 + ( 0 ) -1 + (-1)

-3-5 + (-1)-4 + (+1)-2 + (-1)

-5-7 + (-1)-6 + ( 0 )-4 + (-1)

-4-6 + (-1)-5 + ( 0 )-3 + (-1)


What To Do Next?

Answer: Finding alignments

But, How?


Finding the Alignment(s)

A T G C A G T

A

T

A

A

G

T

1 0 -1 -2 -3 -4 -5

0 2 1 0 -1 -2 -3

-1 1 2 1 1 0 -1

-2 0 1 2 2 1 0

-3 -1 1 1 2 3 2

0 -1 -2 -3

-1

-2

-3

-4 -5 -6

-4

-5

-7

-6 -4 -2 0 1 1 2 4

Similarity Matrix

42 + (-1)3 + (+1)2 + (-1)

TT

31 + (-1)2 + (+1)2 + (-1)

G TG T

21 + (-1)1 + (+1)2 + (-1)

A G TA G T

10 + (-1)1 + ( 0 )2 + (-1)

C A G TA A G T

C A G T - A G T

1-1 + (-1)0 + ( 0 )2 + (-1)

G C A G T - A A G T

1-1 + (-1)0 + (+1)-1 + (-1)

21 + (-1)2 + ( 0 )1 + (-1)

G C A G TA - A G T

20 + (-1)1 + (+1)0 + (-1)

T G C A G TT - A A G T

T G C A G TT A - A G T

A T G C A G TA T A - A G T

A T G C A G TA T - A A G T

Global Alignments


How to Break a Tie?

Should one report all?

Or, report only one?


Advantage of DP Alignment Algorithms

Build up the solution by determining all similarities between arbitrary prefixes of the two sequences

Starting with the shorter prefixes and use previously computed results to solve for larger prefixes


The Complexity of the DP Alignment Algorithm?

Find an optimal alignment

O (m + n)

Construction of the similarity matrix:

O (m • n)


Global versus Local Alignments

A global alignment attempts to match all of one sequence against all of another

LGPSTKQFGKGSSSRIWDN| |||| | | LNQIERSFGKGAIMRLGDA

A local alignment attempts to match subsequences of the two sequences;

-------FGKG-------- |||| -------FGKG--------


How to Compute Local Alignment?


Applying DP to Local Alignment

Similarity Matrix Computation:

a[i,j-1]+g

a[i,j]= max a[i-1,j-1]+sb(i,j)

a[i-1,j]+g

0

0

0

0

0 0 0 0 0

..

..

a[i,0]= 0 ; for i= 0…m

a[0,j]= 0 ; for j= 0…n

If the best alignment up to somepoint has a negative score, it’s better to start a new one, rather

than extend the old one.

Don’t penalize gaps on leftand right ends!


Criteria of Finding a Local Alignment

Find the entries with maximum values in the simularity matrix

For each of such entries, construct an local alignment

See next example

We may also be interested in near-optimal alignments


A T G C A G T

A

T

A

A

G

T

1 0 0 0 1 0 0

0 2 1 0 0 1 1

1 1 2 1 1 0 1

1 1 1 2 2 1 0

0 0 2 1 2 3 2

0 0 0 0

0

0

0

0 0 0

0

0

0

0 0 1 1 2 1 2 4

Similarity Matrix

Similarity Matrix Computation:

a[i,j-1]+g

a[i,j]= max a[i-1,j-1]+sb(i,j)

a[i-1,j]+g

0

A T G C A G TA T - A A G T

A T G C A G TA T A - A G T

A T G CA A G T

Applying DP to Local Alignment


Local Alignment using DPT G A T G G A G G T

G

A

T

A

G

G

0 1 0 0 1 1 0 1 1 0

0 0 0 0

0

0

0

0 0 0

0

0

0

0

0 0 0

0 0 2 0 0 0 2 0 0 0

1 0 0 3 1 0 0 1 0 1

0 0 1 1 2 0 1 0 0 0

0 1 0 0 2 3 1 2 1 0

0 1 0 0 1 3 1 2 3 1

0

0 + (-2)0 + (-1)0 + (-2)0

1

0 + (-2)0 + (+1)0 + (-2)0

T G A T G G A G G T A G G

a[i,j-1]+g

a[i-1,j-1]+sb(i,j)

a[i-1,j]+g

0

a[i,j]= max

1 -1 -1 -1

-1 1 -1 -1

-1 -1 1 -1

-1 -1 -1 1

A T C G

A

T

C

G

g = -2 T G A T - G G A G G T G A T A G G

T G A T G G A G G T G A T A G

T G A T G G A G G T G A T


How to Break a Tie?

Should one report all?

Or, report only one?


Extension to the Basic DP Method

Improving space complexity Introduce general gap functions

That is, the probability of a sequence of consecutive spaces is more likely than individual spaces

Affine gap functions: w(k) = h + gk

Documents

Dynamic Programming and Biological Sequence Comparison