68
Bioinformatics Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

Bioinformatics

Sequence Analysis: Part I. Pairwisealignment and database searching

Fran Lewitter, Ph.D.Head, BiocomputingWhitehead Institute

Page 2: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 2

Bioinformatics Definitions“The use of computational methods to make biologicaldiscoveries.”

Fran Lewitter

Page 3: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 2

Bioinformatics Definitions“The use of computational methods to make biologicaldiscoveries.”

Fran Lewitter

“An interdisciplinary field involving biology, computerscience, mathematics, and statistics to analyze biologicalsequence data, genome content, and arrangement, and topredict the function and structure of macromolecules.”

David Mount

Page 4: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 3

Course SyllabusJan 7 Sequence Analysis I. Pairwise alignments, database searching including

BLAST (FL) [1, 2, 3]Jan 14 Sequence Analysis II. Database searching (continued), Pattern searching(FL)[7]Jan 21 No Class - Martin Luther King HolidayJan 28 Sequence Analysis III. Hidden Markov models, gene finding algorithms (FL)[8]

Feb 4 Computational Methods I. Genomic Resources and Unix (GB)Feb 11 Computational Methods II. Sequence analysis with Perl. (GB)Feb 18 No Class - President's BirthdayFeb 25 Computational Methods III. Sequence analysis with Perl and BioPerl (GB)

Mar 4 Proteins I. Multiple sequence alignments, phylogenetic trees (RL) [4, 6]Mar 11 Proteins II. Profile searches of databases, revealing protein motifs (RL) [9]Mar 18 Proteins III.Structural Genomics:structural comparisons and predictions (RL)

Mar 25 Microarrays: designing chips, clustering methods (FL)

Page 5: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 4

Course Information

• Lectures• Text book• Supplemental reading• Homework• Course project• Office hours - M 2-4, W 2-4• http://fladda.wi.mit.edu/bioinfo• Mailing list, [email protected], Subject: course

Page 6: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 5

Topics to Cover

• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods• Demo

Page 7: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 6

Topics to Cover

• Introduction– Why do alignments?– A bit of history– Definitions

• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods• Demo

Page 8: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 7

Doolittle RF, Hunkapiller MW, Hood LE, DevareSG, Robbins KC, Aaronson SA, Antoniades

HN. Science 221:275-277, 1983.

Simian sarcoma virus onc gene, v-sis, is derivedfrom the gene (or genes) encoding a platelet-derived

growth factor.

Page 9: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 8

Cancer Gene Found

Homology to bacterial and yeast genes shed newlight on human disease process

Page 10: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 9

Evolutionary Basis of SequenceAlignment

• Similarity - observable quantity, such as percent identity

• Homology - conclusion drawn from datathat two genes share a commonevolutionary history; no metric is associatedwith this

Page 11: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 10

Some Definitions

An alignment is a mutual arrangement of twosequences, which exhibits where the twosequences are similar, and where they differ.An optimal alignment is one that exhibits themost correspondences and the leastdifferences. It is the alignment with thehighest score. May or may not bebiologically meaningful.

Page 12: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 11

Alignment Methods

• Global alignment - Needleman-Wunsch(1970) maximizes the number of matchesbetween the sequences along the entirelength of the sequences.

• Local alignment - Smith-Waterman (1981)is a modification of the dynamicprogramming algorithm gives the highestscoring local match between two sequences.

Page 13: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 12

Alignment Methods

Global vs Local

Modular proteins

Fn2 EGF Fn1 EGF Kringle CatalyticF12

EGFFn1 Kringle CatalyticKringlePLAT

Page 14: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 13

Database Searching Methods:Local Alignments

FASTA

Smith-Waterman

Gapped BLAST

OriginalBLAST

SENSITIVITY

SPEE

D

OriginalFASTA

Page 15: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 14

Possible AlignmentsA: T C A G A C G A G T GB: T C G G A G C T G

Page 16: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 14

Possible AlignmentsA: T C A G A C G A G T GB: T C G G A G C T G

I. T C A G A C G A G T GT C G G A - - G C T G

Page 17: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 14

Possible AlignmentsA: T C A G A C G A G T GB: T C G G A G C T G

I. T C A G A C G A G T GT C G G A - - G C T G

II. T C A G A C G A G T G

T C G G A - G C - T G

Page 18: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 14

Possible AlignmentsA: T C A G A C G A G T GB: T C G G A G C T G

I. T C A G A C G A G T GT C G G A - - G C T G

II. T C A G A C G A G T G

T C G G A - G C - T G III. T C A G A C G A G T G

T C G G A - G - C T G

Page 19: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 15

Topics to Cover

• Introduction• Scoring alignments

– Nucleotide vs Proteins• Alignment methods• Significance of alignments• Database searching methods• Demo

Page 20: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 16

Amino Acid SubstitutionMatrices

PAM - point accepted mutation based onglobal alignment [evolutionary model]

BLOSUM - block substitutions based on localalignments [similarity among conservedsequences]

Page 21: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 17

Substitution Matrices

BLOSUM 30

BLOSUM 62

BLOSUM 80

% identity

PAM 250 (80)

PAM 120 (66)

PAM 90 (50)

% change

Lesschange

Page 22: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 18

Part of BLOSUM 62 MatrixC S T P A G N

C 9 S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6N -3 1 0 -2 -2 0

Log-odds = obs freq of aa substitutions freq expected by chance

Page 23: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 18

Part of BLOSUM 62 MatrixC S T P A G N

C 9 S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6N -3 1 0 -2 -2 0

Log-odds = obs freq of aa substitutions freq expected by chance

Page 24: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 18

Part of BLOSUM 62 MatrixC S T P A G N

C 9 S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6N -3 1 0 -2 -2 0

Log-odds = obs freq of aa substitutions freq expected by chance

Page 25: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 18

Part of BLOSUM 62 MatrixC S T P A G N

C 9 S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6N -3 1 0 -2 -2 0

Log-odds = obs freq of aa substitutions freq expected by chance

Page 26: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 19

Part of PAM 250 MatrixC S T P A G N

C 1 2 S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5N -4 1 0 -1 0 0

Log-odds = pair in homologous proteins pair in unrelated proteins by chance

Page 27: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 19

Part of PAM 250 MatrixC S T P A G N

C 1 2 S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5N -4 1 0 -1 0 0

Log-odds = pair in homologous proteins pair in unrelated proteins by chance

Page 28: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 19

Part of PAM 250 MatrixC S T P A G N

C 1 2 S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5N -4 1 0 -1 0 0

Log-odds = pair in homologous proteins pair in unrelated proteins by chance

Page 29: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 19

Part of PAM 250 MatrixC S T P A G N

C 1 2 S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5N -4 1 0 -1 0 0

Log-odds = pair in homologous proteins pair in unrelated proteins by chance

Page 30: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 20

Gap Penalties

• Insertion and Deletions (indels)• Affine gap costs - a scoring system for gaps

within alignments that charges a penalty forthe existence of a gap and an additional per-residue penalty proportional to the gap’slength

Page 31: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 21

Example of simple scoringsystem for nucleic acids

• Match = +1 (ex. A-A, T-T, C-C, G-G)• Mismatch = -1 (ex. A-T, A-C, etc)• Gap opening = - 2• Gap extension = -1

Page 32: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 21

Example of simple scoringsystem for nucleic acids

• Match = +1 (ex. A-A, T-T, C-C, G-G)• Mismatch = -1 (ex. A-T, A-C, etc)• Gap opening = - 2• Gap extension = -1

T C A G A C G A G T GT C G G A - - G C T G

Page 33: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 21

Example of simple scoringsystem for nucleic acids

• Match = +1 (ex. A-A, T-T, C-C, G-G)• Mismatch = -1 (ex. A-T, A-C, etc)• Gap opening = - 2• Gap extension = -1

T C A G A C G A G T GT C G G A - - G C T G

Page 34: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 21

Example of simple scoringsystem for nucleic acids

• Match = +1 (ex. A-A, T-T, C-C, G-G)• Mismatch = -1 (ex. A-T, A-C, etc)• Gap opening = - 2• Gap extension = -1

T C A G A C G A G T GT C G G A - - G C T G+1 +1 -1 +1 +1 -2 -1 -1 -1 +1 +1 = 0

Page 35: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 22

Scoring for BLAST 2 SequencesScore = 94.0 bits (230), Expect = 6e-19Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%)

Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256 Y+ FC + + CY G G +YRG T SGA C PW S V Q A+Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257

Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297 GLG H +CRNPD D +PWC VL RL+WEYCD+ C TSbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298

Position 1: Y - Y = 7Position 2: T - S = 1Position 3: G - S = 0Position 4: P - E = -1 . . .Position 9: - - P = -11Position 10: - - A = -1

. . . Sum 230

Based onBLOSUM62

Page 36: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 23

Topics to Cover• Introduction• Scoring alignments• Alignment methods

– Dot matrix analysis– Exhaustive methods; Dynamic programming algorithm

(Smith-Waterman (Local), Needleman-Wunsch(Global)

– Heuristic methods; Approximate methods; word or k-tuple (FASTA, BLAST)

• Significance of alignments• Database searching methods• Demo

Page 37: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 24

Database Searching Methods:Local Alignments

FASTA

Smith-Waterman

Gapped BLAST

OriginalBLAST

SENSITIVITY

SPEE

D

OriginalFASTA

Page 38: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 25

Dot Matrix Comparison

CoFaX11

Window Size = 8 Scoring Matrix: pam250 matrixMin. % Score = 50Hash Value = 2

100 200 300 400 500 600

100

200

300

400

500

F1EKK

Cata

lytic

CatalyticF2 E F1 E K

Page 39: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 26

Dot Matrix Comparison

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

200 400 600 800 1000 1200

200

400

600

800

1000

1200

Page 40: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 26

Dot Matrix Comparison

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

200 400 600 800 1000 1200

200

400

600

800

1000

1200

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

950 1000 1050 1100 1150 1200 1250 1300 1350

900

1000

1100

1200

1300

Page 41: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 26

Dot Matrix Comparison

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

200 400 600 800 1000 1200

200

400

600

800

1000

1200

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

950 1000 1050 1100 1150 1200 1250 1300 1350

900

1000

1100

1200

1300

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

200 220 240 260 280 300 320 340 360 380

200

220

240

260

280

300

320

340

360

380

400

Page 42: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 27

Dynamic Programming

Page 43: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 27

Dynamic Programming

• Provides very best or optimal alignment

Page 44: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 27

Dynamic Programming

• Provides very best or optimal alignment• Compares every pair of characters (e.g.

bases or amino acids) in the two sequences

Page 45: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 27

Dynamic Programming

• Provides very best or optimal alignment• Compares every pair of characters (e.g.

bases or amino acids) in the two sequences• Puts in gaps and mismatches

Page 46: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 27

Dynamic Programming

• Provides very best or optimal alignment• Compares every pair of characters (e.g.

bases or amino acids) in the two sequences• Puts in gaps and mismatches• Maximum number of matches between

identical or related characters

Page 47: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 27

Dynamic Programming

• Provides very best or optimal alignment• Compares every pair of characters (e.g.

bases or amino acids) in the two sequences• Puts in gaps and mismatches• Maximum number of matches between

identical or related characters• Generates a score and statistical assessment

Page 48: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

Match = +2

Mismatch = -1

Gap = -3

Page 49: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

Match = +2

Mismatch = -1

Gap = -3

Page 50: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

GT

= -1

- TG -

T-- G

= -6

= -6 -1

Match = +2

Mismatch = -1

Gap = -3

Page 51: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

Match = +2

Mismatch = -1

Gap = -3

Page 52: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

Match = +2

Mismatch = -1

Gap = -3

Page 53: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

-4 -9-4 -4

Match = +2

Mismatch = -1

Gap = -3

Page 54: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

-4 -9-4 -4

Match = +2

Mismatch = -1

Gap = -3

Page 55: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

-4 -9-4 -4

-8 -5-13 -5

Match = +2

Mismatch = -1

Gap = -3

Page 56: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

-4 -9-4 -4

-8 -5-13 -5

Match = +2

Mismatch = -1

Gap = -3

Page 57: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

-4 -9-4 -4

-8 -5-13 -5

-6 -6-5 -5

Match = +2

Mismatch = -1

Gap = -3

Page 58: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 28

Dynamic Programming

-27A-24A-21T-18G-15C-12A-9T-6T-3G

-18-15-12-9-6-30GapTAATATGap

-1 -6-6 -1

-4 -9-4 -4

-8 -5-13 -5

-6 -6-5 -5

Match = +2

Mismatch = -1

Gap = -3

Page 59: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 29

Dynamic Programming

-2-7-12-17-22-27A-3-5-4-9-14-19-24A0-5-7-6-11-16-21T0-2-4-6-8-13-18G-21-1-3-10-15C-4-12-3-2-7-12A-6-6-30-2-4-9T

-11-8-5-2-2-1-6T-16-13-10-7-3G-18-15-12-9-6-300TAATAT0

-1 -6-6 -1

-4 -9-4 -4

-8 -5-13 -5

-6 -6-5 -5

Page 60: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 30

Dynamic Programming

-2-7-12-17-22-27A-3-5-4-9-14-19-24A0-5-7-6-11-16-21T0-2-4-6-8-13-18G-21-1-3-10-15C-4-12-3-2-7-12A-6-6-30-2-4-9T

-11-8-5-2-2-1-6T-16-13-10-7-3G-18-15-12-9-6-300TAATAT0

-1 -6-6 -1

-4 -9-4 -4

-8 -5-13 -5

-6 -6-5 -5

- T - A - - T A A T

G T T A C G T A A -

- - T A - - T A A T

Page 61: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 31

Global vs Local Alignment

Examples of aligning the same two proteins bothglobally and locally.

See Chapter 3, example 1 on the online site forBioinformatics by Mount.

Page 62: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 32

Original “Ungapped” BLASTAlgorithm

• To improve speed, use a word based hashingscheme to index database

• Limit search for similarities to only the regionnear matching words

• Use Threshold parameter to rate neighbor words

• Extend match left and right to search for highscoring alignments

Page 63: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 33

Original BLAST AlgorithmQuery word (W=3)

Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMPQG 18 PHG 13PEG 15 PMG 13PNG 13 PTG 12PDG 13 Etc.

NeighborhoodScore threshold(T=13)

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA+LA++L+ TP G R++ +W+ P+ D + ER I A

Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

Neighborhoodwords

Page 64: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 34

Recent BLAST Refinements

• “two-hit” method for extending word pairs• Gapped alignments• Iterate with position-specific matrix (PSI-

BLAST)• Pattern-hit initiated BLAST (PHI-BLAST)

Page 65: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 35

Gapped BLAST

15(+) > 1322(•) > 11

From: NucleicAcids Research,1997, Vol. 25,

No. 173389–3402

Page 66: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 35

Gapped BLAST

15(+) > 1322(•) > 11

From: NucleicAcids Research,1997, Vol. 25,

No. 173389–3402

Page 67: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 36

Gapped BLAST

From: NucleicAcids Research,1997, Vol. 25,

No. 173389–3402

Page 68: Sequ en ce An alysis: Part I. Pairwise alignment and ...barc.wi.mit.edu/education/bioinfo2003/lecture1-color.pdfBioinformatics Sequ en ce An alysis: Part I. Pairwise alignment and

WIBR Bioinformatics Course, © Whitehead Institute, 2002 37

Programs to Compare twosequences

Macintosh– MacVector - Pustell Protein Matrix (DotPlot)

Web– BLAST 2 Sequences– RepeatFinder– lalign

GCG/Unix– BestFit - Smith-Waterman (randomize)– Gap - Needleman -Wunsch (randomize)– Dotter (dot plot)