53
Sequence Alignment Algorithms DEKM book Notes from Dr. Bino John and Dr. Takis Benos 1

Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

Sequence Alignment Algorithms

DEKM book

Notes from Dr. Bino John

and Dr. Takis Benos

1

Page 2: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

To Do

• Global alignment• Local alignment• Gaps

– Affine Gaps– Algorithm (blackboard)

• Statistical Significance– Notes (blackboard)

• Read up on database searches– BLAST– FASTA– CS tricks: suffix tree, …

• PSSMs and Multiple Sequence Alignments

2

Page 3: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

Why compare sequences?

• Given a new sequence, infer its function based on similarity to another sequence

• Find important molecular regions – conserved across species

3

Page 4: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

Why compare sequences? Do more..

• Determine the evolutionary constraints at work

• Find mutations in a population or family of genes

• Find similar looking sequence in a database

• Find secondary/tertiary structure of a sequence of interest – molecular modeling using a template (homology modeling)

4

Page 5: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

Sequence alignment

• Are two sequences related?– Align sequences or parts of them

– Decide if alignment is by chance or evolutionarily linked?

• Issues:– What sorts of alignments to consider?

– How to score an alignment and hence rank?

– Algorithm to find good alignments

– Evaluate the significance of the alignment

5

Page 6: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

6

Page 7: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

7

Dynamic Programming

We apply dynamic programming when:

• There is only a polynomial number of subproblems– Align x1…xi to y1…yj

• Original problem is one of the subproblems– Align x1…xM to y1…yN

• Each subproblem is easily solved from smaller subproblems

Page 8: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

8

Global alignment

Mij

ST

i-1

i

j-1 j

Mi,j = MAX

Mi-1, j-1 + Score(Si,Tj )

Mi,j-1 + γ

Mi-1,j + γ

DNA matrix

PAM

BLOSUM

Gap penalty

Needleman & Wunsch, 1970

Page 9: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

9

Page 10: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

10

Page 11: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

11

Page 12: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

12

Page 13: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

13

Alignment: adding scores (cntd)

Score(match) = 1Score(mismatch) = 0Score(gap) = 0

Page 14: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

14

Alignment: adding scores

Page 15: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

15

Alignment: adding scores (cntd)

(Seq #1) A

|

(Seq #2) A Alignment:

Page 16: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

16

Alignment: adding scores (cntd)

(Seq #1) T A

|

(Seq #2) - A Alignment:

Page 17: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

17

Alignment: adding scores (cntd)

(Seq #1) G A A T T C A G T T A

| | | | | |

(Seq #2) G G A - T C - G - - A Alignment:

6 matches, 1 mism., 4 gaps

Page 18: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

18

Page 19: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

19

Page 20: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

20

Local alignment

Given two sequences, S and T, find two subsequences, s and t, whose alignment has the highest “score” amongst all subsequence pairs.

Question: Why do we need local alignment, if we have the global one?

Page 21: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

21

Local alignment: an example

EGR4_HUMAN KA [FACPVESCVRSFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTSHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]

EGR4_RAT KA [FACPVESCVRTFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTTHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]

EGR3_HUMAN RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]

EGR3_RAT RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]

EGR1_HUMAN RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]

EGR1_MOUSE RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]

EGR1_RAT RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]

EGR1_BRARE RP [YACPVETCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACEI--CGRKFARSDERKRHTKIH]

EGR2_RAT RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]

EGR2_XENLA RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]

EGR2_MOUSE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]

EGR2_HUMAN RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]

EGR2_BRARE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDF--CGRKFARSDERKRHTKIH]

MIG1_KLULA -- [-------------------------] ---RP [YVCPICQRGFHRLEHQTRHIRTH] TGERP [HACDFPGCSKRFSRSDELTRHRRIH]

MIG1_KLUMA -- [-------------------------] ---RP [YMCPICHRGFHRLEHQTRHIRTH] TGERP [HACDFPGCAKRFSRSDELTRHRRIH]

MIG1_YEAST -- [-------------------------] ---RP [HACPICHRAFHRLEHQTRHMRIH] TGEKP [HACDFPGCVKRFSRSDELTRHRRIH]

MIG2_YEAST -- [-------------------------] ---RP [FRCDTCHRGFHRLEHKKRHLRTH] TGEKP [HHCAFPGCGKSFSRSDELKRHMRTH]

[ ] :* [. * * * * * :* . *:* *] ***:* [. * * : *:**** .** : *]

Page 22: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

22

Local alignment (cntd)

Mij

ST

i-1

i

j-1 j

Mi,j = MAXMi-1, j-1 + Score(Si,Tj )

Mi,j-1 + γ

Mi-1,j + γ

DNA matrix

PAM

BLOSUM

Gap penalty

0

Smith & Waterman, 1981 Similarity Scoring Expected value:negative for random alignmentspositive for highly similar sequences

Page 23: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

23

The Smith-Waterman Algorithm

1. Initialization

F(0,0) = F(0,j) = F(i,0) = 0

2. Iteration

for i=1,…,M

for j=1,…,N

- calculate optimal F(i,j)

- store Ptr(i,j)

3. Termination

• Find the end of the best alignment with FOPT = max{i,j} F(i,j) and trace back OR

• Find all alignments with F(i,j) > threshold and trace back

Page 24: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

24

Local vs. global alignment

Page 25: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

25

Local vs. global alignment (cntd)

Page 26: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

26

Local alignment (cntd)

Characteristics of local alignments:

• The alignment can start/end at any point in the matrix.

• No negative scores in the alignment.

• The mean value of the scoring matrix (e.g. PAM, BLOSUM) should be negative, but there should be positive scores in the scoring matrix.

Page 27: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

27

Scoring the gaps more accurately

• A naive model

Gap penalty is linear to the gap length

Nature “prefers” to place gaps where other gaps exist

• Convex gap penalty function

γ(n+1) - γ(n) ≤ γ(n) - γ(n-1)

Time O(N2M) Space O(NM)

(assume N>M)

γ(n)

γ(n)

Page 28: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

28

Scoring gaps: affine gaps

• Affine gaps: a compromise between linear and convex gap penalties

γ(n) = -d - e * (n-1)

d: gap initiation penalty

e: gap extension penaltyd

e

γ(n)

Page 29: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

29

Page 30: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

30

Page 31: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

31

Page 32: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

32

Page 33: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

33

Page 34: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

34

Page 35: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

35

Page 36: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

36

Page 37: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

37

Page 38: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

38

Database searches

Page 39: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

39

DNA and protein databases

• EMBL/GenBank/DDBJ database of nucleic acids

Page 40: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

40

DNA and protein databases

• EMBL/GenBank/DDBJ database of nucleic acids (cntd)

Page 41: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

41

DNA and protein databases

• SWISS-PROT & TrEMBL database of proteins

Page 42: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

42

DNA and protein databases

• SWISS-PROT & TrEMBL database of proteins

Page 43: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

43

Database searches

• Database searching consists of many pairwise alignments combined in one search.

• It helps determining the function and the evolutionary relationships

• Heuristic algorithms are used instead of DP. Why?

• Size of SWISS-PROT + TrEMBL (Rel. 9.5):

3.9M entries or 1,276M residues.

• Exact algorithms are O(NM) fast.

• Heuristic methods can look at a small fraction of the searching space that will include all (or most) of the high scoring pairs.

Page 44: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

44

BLAST algorithm

• Basic Local Alignment Search Tool - The method:

• For each “word” (of fixed-length) in the query sequence, make a list of all neighbouring “words” that score above some threshold.

• Scan the database for these words.

• Perform (ungapped) “hit extension” until score < threshold.

• Stop at maximum scoring extension.

Page 45: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

45

BLAST algorithm (cntd)

• An example:

CPICHRAFHRLEHQTRHMRIHTGEKPHACQuery:

HMR

HHR

HIR

HAR

18

-2+13

+1+13

-1+13

BLOSUM62

HMR

selection

HMR

HIR

.

.

.

Page 46: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

46

BLAST algorithm (cntd)

• An example:

CPICHRAFHRLEHQTRHMRIHTGEKPHAC

H+R

Query:

CPLCDKAFHRLEHQTRHIRTHTGEKPHACSbjct:

Page 47: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

47

BLAST algorithm (cntd)

• An example:

CPICHRAFHRLEHQTRHMRIHTGEKPHAC

H+R

Query:

CPLCDKAFHRLEHQTRHIRTHTGEKPHACSbjct:

CP+C +AFHRLEHQTR HTGEKPHAC

Page 48: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

48

BLAST algorithm (cntd)

• The idea: a high scoring match alignment is very likely to contain a short stretch of very high scoring matches.

• Word length: 3 (proteins) and 11 (DNA).

• HSSP: multiple HSSPs can be reported for each database entry.

• Gapped alignments: more recent BLAST versions perform gapped alignments.

Page 49: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

49

BLAST flavours

Query:

Database:

DNA Protein

DNA Protein

BL

AS

TN

BL

AS

TP

BLASTX

TBLASTN

TBLASTX: DNA Query to DNA Database via translation

Page 50: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

50

FASTA algorithm

• The method:

• For each pair of sequences (query, subject), identify all identical “word” matches of (fixed) length.

• Look for diagonals with many mutually supporting “word” matches.

• The best diagonals are used to extend the word matches to find the maximal scoring (ungapped) regions.

• Join ungapped regions, using gap costs.

• Align the two (sub)regions using full dynamic programming techniques.

Page 51: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

51

FASTA algorithm (cntd)

Page 52: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

52

FASTA algorithm (cntd)

• The idea: a high scoring match alignment is very likely to contain a short stretch of identities.

• Word length: 2 (proteins) and 4-6 (DNA).

• HSSP: usually one (extended) gapped alignment is presented.

Page 53: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 2015. 4. 30. · FASTA algorithm (cntd) • The idea: a high scoring match alignment is very likely to contain a short

53

FASTA flavours

Query:

Database:

DNA Protein

DNA Protein

FA

ST

A3

FA

ST

A3

FASTX3

TFASTA3