42
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute

alignment and database searching

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: alignment and database searching

Bioinformatics for Biologists

Sequence Analysis: Part I. Pairwisealignment and database searching

Fran Lewitter, Ph.D.DirectorBioinformatics & Research ComputingWhitehead Institute

Page 2: alignment and database searching

2WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Topics to Cover

• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods

Page 3: alignment and database searching

3WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Topics to Cover• Introduction

– Why do alignments?– A bit of history– Definitions

• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods

Page 4: alignment and database searching

4WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Doolittle RF, Hunkapiller MW, Hood LE, DevareSG, Robbins KC, Aaronson SA, Antoniades

HN. Science 221:275-277, 1983.

Simian sarcoma virus onc gene, v-sis, is derivedfrom the gene (or genes) encoding a platelet-

derived growth factor.

Page 5: alignment and database searching

5WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Evolutionary Basis ofSequence Alignment

• Similarity - observable quantity, such aspercent identity

• Homology - conclusion drawn from datathat two genes share a commonevolutionary history; no metric isassociated with this

Page 6: alignment and database searching

6WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Some Definitions• An alignment is a mutual arrangement of

two sequences, which exhibits where thetwo sequences are similar, and where theydiffer.

• An optimal alignment is one that exhibitsthe most correspondences and the leastdifferences. It is the alignment with thehighest score. May or may not bebiologically meaningful.

Page 7: alignment and database searching

7WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Alignment Methods

• Global alignment - Needleman-Wunsch(1970) maximizes the number of matchesbetween the sequences along the entirelength of the sequences.

• Local alignment - Smith-Waterman (1981)is a modification of the dynamicprogramming algorithm giving the highestscoring local match between two sequences.

Page 8: alignment and database searching

8WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Alignment MethodsGlobal vs Local

Modular proteins

Fn2 EGF Fn1 EGF Kringle CatalyticF12

EGFFn1 Kringle CatalyticKringlePLAT

Page 9: alignment and database searching

9WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Local vs Global AlignmentL G P S S K Q T G K G S - S R I W D NL N - I T K S A G K G A I M R L G D A

GLOBAL

- - - - - - - T G K G - - - - - - - -- - - - - - - A G K G - - - - - - - -

LOCALFrom Mount, Bioinformatics, 2004, pg 71

Page 10: alignment and database searching

10WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Possible AlignmentsA: T C A G A C G A G T GB: T C G G A G C T G

I. T C A G A C G A G T GT C G G A - - G C T G

II. T C A G A C G A G T G

T C G G A - G C - T G III. T C A G A C G A G T G

T C G G A - G - C T G

Page 11: alignment and database searching

11WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Topics to Cover

• Introduction• Scoring alignments

– Nucleotide vs Proteins– Definitions

• Alignment methods• Significance of alignments• Database searching methods

Page 12: alignment and database searching

12WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Example of simple scoringsystem for nucleic acids

• Match = +1 (ex. A-A, T-T, C-C, G-G)• Mismatch = -1 (ex. A-T, A-C, etc)• Gap opening = - 5• Gap extension = -2

T C A G A C G A G T GT C G G A - - G C T G+1 +1 -1 +1 +1 -5 -2 -1 -1 +1 +1 = -4

Page 13: alignment and database searching

13WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Possible AlignmentsA:T C A G A C G A G T GB:T C G G A G C T G

I. T C A G A C G A G T GT C G G A - - G C T G

II. T C A G A C G A G T G

T C G G A - G C - T G III. T C A G A C G A G T G

T C G G A - G - C T G

-4

-5

-5

Page 14: alignment and database searching

14WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Amino Acid SubstitutionMatrices

• PAM - point accepted mutation based onglobal alignment [evolutionary model]

• BLOSUM - block substitutions based onlocal alignments [similarity among conservedsequences]

Page 15: alignment and database searching

15WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Substitution MatricesBLOSUM 30

BLOSUM 62

BLOSUM 80

% identity

PAM 250 (80)

PAM 120 (66)

PAM 90 (50)

% change

Lesschange

Page 16: alignment and database searching

16WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Part of BLOSUM 62 MatrixC S T P A G N

C 9 S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6N -3 1 0 -2 -2 0

Log-odds = obs freq of aa substitutions freq expected by chance

Page 17: alignment and database searching

17WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Part of PAM 250 MatrixC S T P A G N

C 12 S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5N -4 1 0 -1 0 0

Log-odds = pair in homologous proteins pair in unrelated proteins by chance

Page 18: alignment and database searching

18WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Gap Penalties

• Insertion and Deletions (indels)• Affine gap costs - a scoring system for gaps

within alignments that charges a penalty forthe existence of a gap and an additional per-residue penalty proportional to the gap’slength

Page 19: alignment and database searching

19WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Scoring for BLAST 2 SequencesScore = 94.0 bits (230), Expect = 6e-19Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%)

Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256 Y+ FC + + CY G G +YRG T SGA C PW S V Q A+Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257

Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297 GLG H +CRNPD D +PWC VL RL+WEYCD+ C TSbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298

Position 1: Y - Y = 7Position 2: T - S = 1Position 3: G - S = 0Position 4: P - E = -1 . . .Position 9: - - P = -11Position 10: - - A = -1

. . . Sum 230

Based onBLOSUM62

Page 20: alignment and database searching

20WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Topics to Cover• Introduction• Scoring alignments• Alignment methods

– Dot matrix analysis– Exhaustive methods; Dynamic programming algorithm

(Smith-Waterman (Local), Needleman-Wunsch(Global))

– Heuristic methods; Approximate methods; word or k-tuple (FASTA, BLAST, BLAT)

• Significance of alignments• Database searching methods

Page 21: alignment and database searching

21WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Dot Matrix Comparison

CoFaX11

Window Size = 8 Scoring Matrix: pam250 matrixMin. % Score = 50Hash Value = 2

100 200 300 400 500 600

100

200

300

400

500

F1EKK

Cata

lytic

CatalyticF2 E F1 E K

Page 22: alignment and database searching

22WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Dot Matrix Comparison

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

200 400 600 800 1000 1200

200

400

600

800

1000

1200

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

950 1000 1050 1100 1150 1200 1250 1300 1350

900

1000

1100

1200

1300

FLO11

Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2

200 220 240 260 280 300 320 340 360 380

200

220

240

260

280

300

320

340

360

380

400

Page 23: alignment and database searching

23WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Dynamic Programming• Provides very best or optimal alignment• Compares every pair of characters (e.g. bases or

amino acids) in the two sequences• Puts in gaps and mismatches• Maximum number of matches between identical

or related characters• Generates a score and statistical assessment• Nice example of global alignment using N-W:

http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html

Page 24: alignment and database searching

24WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

BLAST Algorithm (1990)“Ungapped” alignment

• To improve speed, use a word based hashingscheme to index database

• Limit search for similarities to only the regionnear matching words

• Use Threshold parameter to rate neighbor words

• Extend match left and right to search for highscoring alignments

Page 25: alignment and database searching

25WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Original BLAST Algorithm (1990)Query word (W=3)

Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMPQG 18 PHG 13PEG 15 PMG 13PNG 13 PTG 12PDG 13 Etc.

NeighborhoodScorethreshold(T=13)

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA+LA++L+ TP G R++ +W+ P+ D + ER I A

Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

Neighborhoodwords

Page 26: alignment and database searching

26WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

BLAST Refinements (1997)

• “two-hit” method for extending word pairs• Gapped alignments• Additional algorithms

– Iterate with position-specific matrix (PSI-BLAST)

– Pattern-hit initiated BLAST (PHI-BLAST)

Page 27: alignment and database searching

27WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Gapped BLAST

15(+) > 1322(•) > 11

(Altschul et al 1997)

Page 28: alignment and database searching

28WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Gapped BLAST

(Altschul et al 1997)

Page 29: alignment and database searching

29WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Programs to Compare twosequences - Unix or Web

NCBIBLAST 2 Sequences

EMBOSSwater - Smith-Watermanneedle - Needleman -Wunschdotmatch (dot plot)einverted or palindrome (inverted repeats)equicktandem or etandem (tandem repeats)

Otherlalign (multiple matching subsegments in two sequences)

Page 30: alignment and database searching

30WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments

– How strong can alignment be by chance alone?• Database searching methods

Page 31: alignment and database searching

31WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Statistical Significance• Raw Scores - score of an alignment equal to the sum of

substitution and gap scores.• Bit scores - scaled version of an alignment’s raw score

that accounts for the statistical properties of the scoringsystem used.

• E-value - expected number of distinct alignments thatwould achieve a given score by chance. Lower E-value =>more significant.

Page 32: alignment and database searching

32WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Some formulas

E = Kmn e-λS

This is the Expected number of high-scoringsegment pairs (HSPs) with score at least S

for sequences of length m and n.

This is the E value for the score S.

Page 33: alignment and database searching

33WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods

– BLAST– BLAST vs. FASTA– BLAT

Page 34: alignment and database searching

34WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Questions• Why do a database search?• What database should be searched?• What alignment algorithm to use?• What do the results mean?• What parameters can be changed?

– Substitution matrices– Statistical significance– Filtering for low complexity

Page 35: alignment and database searching

35WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

BLASTP Results

Page 36: alignment and database searching

36WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

WU-BLAST vs NCBI BLAST• WU-BLAST first for gapped alignments• Use different scoring system for gaps• Report different statistics• WU-BLAST does not filter low-complexity by

default• WU-BLAST looks for and reports multiple

regions of similarity• Results will be different!

Page 37: alignment and database searching

37WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

BLAT• Blast-Like Alignment Tool• Developed by Jim Kent at UCSC• For DNA it is designed to quickly find sequences of >=

95% similarity of length 40 bases or more.• For proteins it finds sequences of >= 80% similarity of

length 20 amino acids or more.• DNA BLAT works by keeping an index of the entire

genome in memory - non-overlapping 11-mers (< 1 GB ofRAM)

• Protein BLAT uses 4-mers (~ 2 GB)

Page 38: alignment and database searching

38WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

FASTA• Index "words" and locate identities• Rescore best 10 regions• Find optimal subset of initial regions that

can be joined to form single alignment• Align highest scoring sequences using

Smith-Waterman

Page 39: alignment and database searching

39WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Page 40: alignment and database searching

40WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Basic Searching Strategies

• Search early and often• Use specialized databases• Use multiple matrices• Use filters• Consider Biology

Page 41: alignment and database searching

41WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Exercises1. Working with Entrez at NCBI

1. Limits, history, preview2. Batch Entrez

2. Pairwise alignment and database searching1. Self comparison with dottup2. Local vs. global alignment3. Sequence revision history, ReadSeq and BL2SEQ4. BLAT searching5. Comparing BLAST, WU-BLAST, and FASTA6. The Lost World7. Redo exercises on hebrides

Page 42: alignment and database searching

42WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

References1. Class web site:• http://jura.wi.mit.edu/bio/education/bioinfo2005/seq

2. Books• David Mount - Bioinformatics: Sequence and

Genome Analysis, 2nd edition, CSHL Press, 2004.• Andreas D. Baxevanis and B. F. Francis Ouellette

(editors) - Bioinformatics: A Practical Guide to theAnalysis of Genes and Proteins, 3rd edition, Wiley,2004.