48
DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we want to decipher both its meaning and its history …

We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”

  • Upload
    reed

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we want to decipher both its meaning and its history …. Fortunate the genetic code is alphabetic … susceptible to perform string comparisons and pattern recognition. - PowerPoint PPT Presentation

Citation preview

Page 1: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we

want to decipher both its meaning and its history …

Page 2: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

We do not have to understand the languaje to identify patterns:

“klaatu barada nikto”

Fortunate the genetic code is alphabetic … susceptible to perform string comparisons and

pattern recognition

Page 3: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Pairwise Sequence Alignment

Page 4: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties

• Methods of pairwise sequence alignment • window-based methods• dynamic programming approaches

Pairwise Sequence Alignment

Page 5: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

A TTCACATA

T A C A T T A C G T A C

Sequence 1

Sequence 2

Pairwise Sequence Alignment: How to?

Page 6: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Dotplot:

A T T C

A C

A T A

T A C A T T A C G T A CSequence 1

Sequence 2

A dotplot gives an overview of all possible alignments

Page 7: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Dotplot:

A T T C

A C

A T A

T A C A T T A C G T A C

T A C A T T A C G T A C

A T A C A C T T A

Sequence 1

Sequence 2

One possible alignment:

In a dotplot each diagonal corresponds to a possible (ungapped) alignment

Page 8: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties

• Methods of pairwise sequence alignment • window-based methods• dynamic programming approaches

Pairwise Sequence Alignment

Page 9: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Window-based Approaches

• Word Size

• Window / Stringency

Page 10: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Word Size Algorithm

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

C T A T G A C A

T A C G G T A T G

Word Size = 3

Page 11: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Window / Stringency

T A C G G T A T G

T C A G T A T C

T A C G G T A T G

T C A G T A T C

T A C G G T A T G

T C A G T A T C

T A C G G T A T G

T C A G T A T C

C T A T G A CA

T A C G G T A T G

Window = 5 / Stringency = 4

Page 12: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Considerations

• The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted).

• The smaller the window, the larger the weight of statistical (unspecific) matches.

• With large windows the sensitivity for short sequences is reduced.

• Insertions/deletions are not treated explicitly.

Page 13: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Insertions / Deletions in a Dotplot

T

A

C

T

G

T

C

A

T

T A C T G T T C A TSequence 1

Sequence 2

T A C T G - T C A T| | | | | | | | |T A C T G T T C A T

Page 14: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Hemoglobin -chain

Hemoglobin

-chain

Dotplot (Window = 130 / Stringency = 9)

Output of the programs Compare and DotPlot

Page 15: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Dotplot (Window = 18 / Stringency = 10)

Output of the programs Compare and DotPlot

Hemoglobin

-chain

Hemoglobin -chain

Page 16: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties

• Methods of pairwise sequence alignment • window-based approaches• dynamic programming approaches

• Needleman and Wunsch• Smith and Waterman

Pairwise Sequence Alignment

Page 17: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Automatic procedure that finds the best alignment

with an optimal score depending on the chosen parameters.

Dynamic Programming

Recursive solutions. We solve smaller problems first, and

use those solutions to solve larger problems. Intermediate

solutions are stored in a tabular matrix.

Page 18: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Basic principles of dynamic programming

- Initialization of alignment matrix: the scoring model

- Stepwise calculation of score values

(creation of an alignment path matrix)

- Backtracking (evaluation of the optimal path)

Page 19: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Initialization of Matrix (BLOSUM 50): A distance metric

H E A G A W G H E E

P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A -2 -1 5 0 5 -3 0 -2 -1 -1

W -3 -3 -3 -3 -3 15 -3 -3 -3 -3

H 10 0 -2 -2 -2 -3 -2 10 0 0

E 0 6 -1 -3 -1 -3 -3 0 6 6

A -2 -1 5 0 5 -3 0 -2 -1 -1

E 0 6 -1 -3 -1 -3 -3 0 6 6

Page 20: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Needleman and Wunsch(global alignment)

Sequence 1: H E A G A W G H E ESequence 2: P A W H E A E

Scoring parameters: BLOSUM50 matrix

Gap penalty: Linear gap penalty of 8

Page 21: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Creation of an alignment path matrix

Idea:Build up an optimal alignment using previous solutions for

optimal alignments of smaller subsequences

• Construct matrix F indexed by i and j (one index for each sequence)

• F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj

• Build F(i,j) recursively beginning with F(0,0) = 0

-A

EE

HHG-WWAA

G-AP

E-H-

Optimal global alignment: EE

Page 22: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

Creation of an alignment path matrix

HEAGAWGHE-E--P-AW-HEAE

Optimal global alignment:

Page 23: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

F(i, j) = F(i-1, j-1) + s(xi ,yj)

F(i, j) = max F(i, j) = F(i-1, j) - d

F(i, j) = F(i, j-1) - d

F(i-1, j-1) F(i, j-1)

F(i-1,j) F(i, j)

-d

-d

s(xi ,yj)

Creation of an alignment path matrix

HEAGAWGHE-E--P-AW-HEAE

Page 24: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)

• Three possibilities:

• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)

• xi is aligned to a gap, F(i,j) = F(i-1,j) - d

• yj is aligned to a gap, F(i,j) = F(i,j-1) - d

• The best score up to (i,j) will be the largest of the three options

Creation of an alignment path matrix

Page 25: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

H E A G A W G H E E 0

P

A

W

H

E

A

E

-8 -16 -24 -32 -40 -48 -56 -64 -72 -80

-8

-16

-24

-32

-40

-48

-56

F(j, 0) = -j d

Boundary conditions

F(i, 0) = -i d

Creation of an alignment path matrix

Page 26: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8

A -16

W -24

H -32

E -40

A -48

E -56

Stepwise calculation of score values

-2

-10

-9

-3

F(i, j) = F(i-1, j-1) + s(xi ,yj)

F(i, j) = max F(i, j) = F(i-1, j) - d

F(i, j) = F(i, j-1) - d

F(0,0) + s(xi ,yj) = 0 -2 = -2

F(1,1) = max F(0,1) - d = -8 -8= -16 = -2

F(1,0) - d = -8 -8= -16

F(1,0) + s(xi ,yj) = -8 -1 = -9

F(2,1) = max F(1,1) - d = -2 -8 = -10 = -9

F(2,0) - d = -16 -8= -24

-8 -2 = -10

F(1,2) = max -16 -8 = -24 = -10

-2 -8 = -10

-2 -1 = -3

F(2,2) = max -10 -8 = -18 = -3

-9 -8 = -17

P-H=-2

E-P=-1

H-A=-2

E-A=-1

Page 27: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

Backtracking

-5

1

-A

EE

HHG-WWAA

G-AP

E-H-

0

-25

-5

-20

-13

-3

3

-8 -16

-17

Optimal global alignment: EE

Page 28: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Two differences:

1.

2. An alignment can now end anywhere in the matrix

Smith and Waterman(local alignment)

Example:Sequence 1 H E A G A W G H E ESequence 2 P A W H E A E

Scoring parameters: Log-odds ratiosGap penalty: Linear gap penalty of 8

0

F(i, j) = F(i-1, j-1) + s(xi ,yj)

F(i, j) = F(i-1, j) - d

F(i, j) = F(i, j-1) - d

F(i, j) = max

Page 29: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

Smith Waterman alignment

Optimal local alignment: AA

G-

EE

HH

WW

28

0

5

20 12

22

Page 30: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Extended Smith & Waterman

To get multiple local alignments:• delete regions around best path

• repeat backtracking

Page 31: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 0 0 0 0 0

W 0 0 0 0 2 0 0 0

H 0 10 2 0 0 0

E 0 2 16 8 0 0

A 0 0 8 21 13 5 0

E 0 0 6 13 18 12 4 0

0

5

20 12 4

12 18 22 14 6

4 10 18 28 20

4 10 20 27

4 16 26

Extended Smith & Waterman

Page 32: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 0 0 0 0 0

W 0 0 0 0 2 0 0 0

H 0 10 2 0 0 0

E 0 2 16 8 0 0

A 0 0 8 21 13 5 0

E 0 0 6 13 18 12 4 0

Second best local alignment:

0

21

10

16

HHEEAA

Extended Smith & Waterman

Page 33: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Further Extensions of Dynamic Programming

• Overlap matches

• Alignment with affine gap scores

Page 34: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

• Pairwise sequence comparison• global / local alignments• parameters• scoring systems• insertions / deletions

• Methods of pairwise sequence alignment • dotplot• windows-based methods• dynamic programming• algorithm complexity

Pairwise Sequence Alignment

Page 35: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

End.of.pa.irwise..sequence | | | | | align.ment.cours.e

Page 36: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Methods of Pairwise Comparison

Multiple AlignmentProgressive Alignment:

step

Progressive Alignment:

step

Programs perform global alignments:

• Needleman & Wunsch: (Pileup, Tree, Clustal)

• Word Size Method: (Clustal)

• X. Huang (MAlign) (modified N-W)

1.

Page 37: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Construction of a Guide Tree

Multiple AlignmentProgressive Alignment:

step

Progressive Alignment:

step

1 2 3 4 5

1

2

3

4

5

Sequence

Similarity Matrix:

displays scores ofall sequence pairs.

The similarity matrix is transformed into a distance matrix . . . . .

2.

Page 38: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Construction of a Guide Tree

Multiple AlignmentProgressive Alignment:

step

Progressive Alignment:

step

DistanceMatrix

1

23

4

5

Guide Tree

Neighbour-Joining Method or

UPGMA (unweighted pair group method of arithmetic averages)

2.

Page 39: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Multiple Alignment

Multiple AlignmentProgressive Alignment:

step

Progressive Alignment:

step

1

23

4

5

Guide Tree

2

3.

1

Page 40: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

T T A C T T C C A G G

Columns - once aligned - are never changed

Multiple AlignmentProgressive Alignment:

step

Progressive Alignment:

step

T T A C T T C C A G G

3.

G T C C G - - C A G G

T T - C G C - C - G G

G T C C G - C A G G

T T - C G C C - G G

Page 41: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

T T A C T T C C A G G

Columns - once aligned - are never changed

Multiple AlignmentProgressive Alignment:

step

Progressive Alignment:

step

T T A C T T C C A G G

3.

G T C C G - - C A G G

T T - C G C - C - G G

G T C C G - C A G G

T T - C G C C - G G

. . . . and new gaps are inserted.

Page 42: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

T T A C T T C C A G G

Columns - once aligned - are never changed

Multiple AlignmentProgressive Alignment:

step

Progressive Alignment:

step3.

G T C C G - - C A G G

T T - C G C - C - G G

A T C - T - - C A A T

C T G - T C C C T A G

A T C T - - C A A T

C T G T C C C T A G

T T A C T T C C A G G

G T C C G - - C A G G

T T - C G C - C - G G

Page 43: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Sub-sequence alignments

Page 44: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

A K-means like clustering problem

Page 45: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Clustering resulting model

Page 46: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Clustering predictions

Page 47: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Assignments

•Describe a pairwise alignment with a different gap penalization.

•Provide an example and perform a multiple global alignment. Describe the recipe.

•Provide an example and and perform a multiple alignment of subsequences. Describe the recipe.

•Algorithms Order (polynomial, exponential, NP)

Page 48: We do not have to understand the languaje to identify patterns:  “ klaatu barada nikto”

Algorithmic Complexity

How does an algorithm‘s performance in CPU time and required memory storage scale with the size of the problem?

Needleman & Wunsch

• Storing (n+1)x(m+1) numbers

• Each number costs a constant number of calculations to compute (three sums and a max)

• Algorithm takes O(nm) memory and O(nm) time

• Since n and m are usually comparable: O(n2)