Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,

Fundamentals of PredictingBiology from Sequence (Part1):

Pairwise alignment and databasesearching

MGH/MIND April 2005

Fran Lewitter, Ph.D.Director, Bioinformatics & Research ComputingWhitehead Institute for Biomedical Researchhttp://jura.wi.mit.edu/bio

2WI Intro to Sequence Analysis, © Whitehead Institute, April 2005

Bioinformatics Definitions“The use of computational methods to make biologicaldiscoveries.”

Fran Lewitter

“An interdisciplinary field involving biology, computerscience, mathematics, and statistics to analyze biologicalsequence data, genome content, and arrangement, and topredict the function and structure of macromolecules.”

David Mount


The Old Way


Topics to Cover• Introduction

– Why do alignments?– A bit of history– Definitions

• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods• Pattern searching


Doolittle RF, Hunkapiller MW, Hood LE, DevareSG, Robbins KC, Aaronson SA, Antoniades

HN. Science 221:275-277, 1983.

Simian sarcoma virus onc gene, v-sis, is derivedfrom the gene (or genes) encoding a platelet-

derived growth factor.


Evolutionary Basis ofSequence Alignment

• Similarity - observable quantity, such aspercent identity

• Homology - conclusion drawn from datathat two genes share a commonevolutionary history; no metric isassociated with this


Some Definitions• An alignment is a mutual arrangement of

two sequences, which exhibits where thetwo sequences are similar, and where theydiffer.

• An optimal alignment is one that exhibitsthe most correspondences and the leastdifferences. It is the alignment with thehighest score. May or may not bebiologically meaningful.


Alignment Methods

• Global alignment - Needleman-Wunsch(1970) maximizes the number of matchesbetween the sequences along the entirelength of the sequences.

• Local alignment - Smith-Waterman (1981)is a modification of the dynamicprogramming algorithm giving the highestscoring local match between two sequences.


Alignment MethodsGlobal vs Local

Modular proteins

Fn2 EGF Fn1 EGF Kringle CatalyticF12

EGFFn1 Kringle CatalyticKringlePLAT


Local vs Global AlignmentL G P S S K Q T G K G S - S R I W D NL N - I T K S A G K G A I M R L G D A

GLOBAL

- - - - - - - T G K G - - - - - - - -- - - - - - - A G K G - - - - - - - -

LOCALFrom Mount, Bioinformatics, 2004, pg 71


Possible AlignmentsA: T C A G A C G A G T GB: T C G G A G C T G

I. T C A G A C G A G T GT C G G A - - G C T G

II. T C A G A C G A G T G

T C G G A - G C - T G III. T C A G A C G A G T G

T C G G A - G - C T G


Topics to Cover• Introduction• Scoring alignments

– Nucleotide vs Proteins– Definitions

• Alignment methods• Significance of alignments• Database searching methods• Pattern searching


Example of simple scoringsystem for nucleic acids

• Match = +1 (ex. A-A, T-T, C-C, G-G)• Mismatch = -1 (ex. A-T, A-C, etc)• Gap opening = - 5• Gap extension = -2

T C A G A C G A G T GT C G G A - - G C T G+1 +1 -1 +1 +1 -5 -2 -1 -1 +1 +1 = -4


Possible AlignmentsA:T C A G A C G A G T GB:T C G G A G C T G

I. T C A G A C G A G T GT C G G A - - G C T G

II. T C A G A C G A G T G

T C G G A - G C - T G III. T C A G A C G A G T G

T C G G A - G - C T G

-4

-5

-5


Gap Penalties

• Insertion and Deletions (indels)• Affine gap costs - a scoring system for gaps

within alignments that charges a penalty forthe existence of a gap and an additional per-residue penalty proportional to the gap’slength


Amino Acid SubstitutionMatrices

• PAM - point accepted mutation based onglobal alignment [evolutionary model]

• BLOSUM - block substitutions based onlocal alignments [similarity among conservedsequences]


Substitution MatricesBLOSUM 30

BLOSUM 62

BLOSUM 80

% identity

PAM 250 (80)

PAM 120 (66)

PAM 90 (50)

% change

Lesschange


Part of BLOSUM 62 MatrixC S T P A G N

C 9 S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6N -3 1 0 -2 -2 0

Log-odds = obs freq of aa substitutions freq expected by chance


Part of PAM 250 MatrixC S T P A G N

C 12 S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5N -4 1 0 -1 0 0

Log-odds = pair in homologous proteins pair in unrelated proteins by chance


Scoring for BLAST 2 SequencesScore = 94.0 bits (230), Expect = 6e-19Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%)

Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256 Y+ FC + + CY G G +YRG T SGA C PW S V Q A+Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257

Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297 GLG H +CRNPD D +PWC VL RL+WEYCD+ C TSbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298

Position 1: Y - Y = 7Position 2: T - S = 1Position 3: G - S = 0Position 4: P - E = -1 . . .Position 9: - - P = -11Position 10: - - A = -1

. . . Sum 230

Based onBLOSUM62


Topics to Cover• Introduction• Scoring alignments• Alignment methods

– Dot matrix analysis– Exhaustive methods; Dynamic programming algorithm

(Smith-Waterman (Local), Needleman-Wunsch(Global))

– Heuristic methods; Approximate methods; word or k-tuple (FASTA, BLAST, BLAT)

• Significance of alignments• Database searching methods• Pattern searching


Dot Matrix Comparison

CoFaX11

Window Size = 8 Scoring Matrix: pam250 matrixMin. % Score = 50Hash Value = 2

100 200 300 400 500 600

100

200

300

400

500

F1EKK

Cata

lytic

CatalyticF2 E F1 E K


Dot Matrix Comparison

FLO11


200 400 600 800 1000 1200

200

400

600

800

1000

1200

FLO11


950 1000 1050 1100 1150 1200 1250 1300 1350

900

1000

1100

1200

1300

FLO11


200 220 240 260 280 300 320 340 360 380

200

220

240

260

280

300

320

340

360

380

400


BLAST Algorithm (1990)“Ungapped” alignment

• To improve speed, use a word based hashingscheme to index database

• Limit search for similarities to only the regionnear matching words

• Use Threshold parameter to rate neighbor words

• Extend match left and right to search for highscoring alignments


Original BLAST Algorithm (1990)Query word (W=3)

Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMPQG 18 PHG 13PEG 15 PMG 13PNG 13 PTG 12PDG 13 Etc.

NeighborhoodScorethreshold(T=13)

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA+LA++L+ TP G R++ +W+ P+ D + ER I A

Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

Neighborhoodwords


BLAST Refinements (1997)

• “two-hit” method for extending word pairs• Gapped alignments• Additional algorithms

– Iterate with position-specific matrix (PSI-BLAST)

– Pattern-hit initiated BLAST (PHI-BLAST)


Gapped BLAST

15(+) > 1322(•) > 11

(Altschul et al 1997)


Gapped BLAST

(Altschul et al 1997)


Programs to Compare twosequences - Unix or Web

NCBIBLAST 2 Sequences

EMBOSSwater - Smith-Watermanneedle - Needleman -Wunschdotmatch (dot plot)einverted or palindrome (inverted repeats)equicktandem or etandem (tandem repeats)

Otherlalign (multiple matching subsegments in two sequences)


Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments

– How strong can alignment be by chance alone?• Database searching methods• Pattern searching


Statistical Significance• Raw Scores - score of an alignment equal to the sum of

substitution and gap scores.• Bit scores - scaled version of an alignment’s raw score

that accounts for the statistical properties of the scoringsystem used.

• E-value - expected number of distinct alignments thatwould achieve a given score by chance. Lower E-value =>more significant.


Some formulas

E = Kmn e-λS

This is the Expected number of high-scoringsegment pairs (HSPs) with score at least S

for sequences of length m and n.

This is the E value for the score S.


Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods

– BLAST– BLAST vs. FASTA– BLAT

• Pattern searching


Questions• Why do a database search?• What database should be searched?• What alignment algorithm to use?• What do the results mean?• What parameters can be changed?

– Substitution matrices– Statistical significance– Filtering for low complexity


BLASTP Results


WU-BLAST vs NCBI BLAST• WU-BLAST first for gapped alignments• Use different scoring system for gaps• Report different statistics• WU-BLAST does not filter low-complexity by

default• WU-BLAST looks for and reports multiple

regions of similarity• Results will be different!


BLAT• Blast-Like Alignment Tool• Developed by Jim Kent at UCSC• For DNA it is designed to quickly find sequences of >=

95% similarity of length 40 bases or more.• For proteins it finds sequences of >= 80% similarity of

length 20 amino acids or more.• DNA BLAT works by keeping an index of the entire

genome in memory - non-overlapping 11-mers (< 1 GB ofRAM)

• Protein BLAT uses 4-mers (~ 2 GB)


FASTA• Index "words" and locate identities• Rescore best 10 regions• Find optimal subset of initial regions that

can be joined to form single alignment• Align highest scoring sequences using

Smith-Waterman



Basic Searching Strategies

• Search early and often• Use specialized databases• Use multiple matrices• Use filters• Consider Biology


Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods• Pattern searching

– PSI-BLAST– PHI-BLAST– Finding patterns


PSI-BLAST• Position Specific Iterative BLAST uses a profile (or position

specific scoring matrix, PSSM) that is constructed (automatically)from a multiple alignment of the highest scoring hits in an initialBLAST search.

• The PSSM is generated by calculating position-specific scores foreach position in the alignment. Highly conserved positions receivehigh scores and weakly conserved positions receive scores near zero.

• The profile is used to perform a second (etc.) BLAST search and theresults of each "iteration" is used to refine the profile. This iterativesearching strategy results in increased sensitivity.


Start with a BLASTP search


PSI-BLAST - Iteration 1


PSSM from PSI-BLAST

POSITIONS

A R N D C Q E G H I L K M F P S T W Y V

1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2

2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2

3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7

4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2

5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0

6 4 3 2 …

• …

• …

N

Aminoacids


Pattern Hit Initiated (PHI)-BLAST

>HUMAN MSH2

MAVQPKETLQLESAAEVGFVRFFQGMPEKPTTTVRLFDRGDFYTAHGEDALLAAREVFKTQGVIKYMGPAGAKNLQSVVLSKMNFESFVKDLLLVRQYRVEVYKNRAGNKASKENDWYLAYKASPGNLSQFEDILFGNNDMSASIGVVGVKMSAVDGQRQVGVGYVDSIQRKLGLCEFPDNDQFSNLEALLIQIGPKECVLPGGETAGDMGKLRQIIQRGGILITERKKADFSTKDIYQDLNRLLKGKKGEQMNSAVLPEMENQVAVSSLSAVIKFLELLSDDSNFGQFELTTFDFSQYMKLDIAAVRALNLFQGSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDLLRRFPDLNRLAKKFQRQAANLQDCYRLYQGINQLPNVIQALEKHEGKHQKLLLAVFVTPLTDLRSDFSKFQEMIETTLDMDQVENHEFLVKPSFDPNLSELREIMNDLEKKMQSTLISAARDLGLDPGKQIKLDSSAQFGYYFRVTCKEEKVLRNNKNFSTVDIQKNGVKFTNSKLTSLNEEYTKNKTEYEEAQDAIVKEIVNISSGYVEPMQTLNDVLAQLDAVVSFAHVSNGAPVPYVRPAILEKGQGRIILKASRHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKDSLIIIDELGRGTSTYDGFGLAWAISEYIATKIGAFCMFATHFHELTALANQIPTVNNLHVTALTTEETLTMLYQVKKGVCDQSFGIHVAELANFPKHVIECAKQKALELEEFQYIGESQGYDIMEPAAKKCYLEREQGEKIIQEFLSKVKQMPFTEMSEENITIKLKQLKAEVIAKNNSFVNEIISRIKVTTDNA mismatch

repair proteins mutSfamily signature


PHI-BLAST


Pattern Searching

allow one mismatch, deletion andinsertion

p1=6...6 3..8 p1[1,1,1]

exact 6 character repeat separated byup to 8

p1=6...6 3...8 p1

hairpin with GAGA as the loopp1=6...8 GAGA ~p1

[ ] Acceptable { } Not acceptableX(2) 2 of anything in a rowX(2,4) from 2 to four of of anything

[DE](2)HS{P}X(2)PX(2,4)CExact pattern: TATAATATAA

4 purines followed by 4 pyrimidinesRRRRYYYY or R(4)Y(4)


Pattern Searching Programs

EMBOSS programs; web and Unixfuzznuc,fuzzprot,fuzztrans,dreg

scan_for_matches patfile < inputfilePatscan


Sequence of Note 1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC 31 CCCCCTGACG AGCATCACAA AAATCGACGC 61 GGTGGCGAAA CCCGACAGGA CTATAAAGAT…………..1371 GTAAAGTCTG GAAACGCGGA AGTCAGCGCC

“Here you see the actual structure of a smallfragment of dinosaur DNA,” Wu said. “Notice thesequence is made up of four compounds -adenine, guanine, thymine and cytosine. Thisamount of DNA probably contains instructions tomake a single protein - say, a hormone or anenzyme. The full DNA molecule contains threebillion of these bases. If we looked at a screen likethis once a second, for eight hours a day, it’d stilltake more than two years to look at the entireDNA strand. It’s that big.” (page 103)


DinoDNA "Dinosaur DNA" fromCrichton's THE LOST WORLD p. 135

GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCCTCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGCAGACACGGGTACTTTGGGGACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTGCAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGCGCCTGCGGCCTCTACTACAAAC…………………………


>Erythroid transcription factor (NF-E1 DNA-binding protein)

Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLGSbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60

Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGATSbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116

Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQTSbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169

Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG Sbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226

Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF Sbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286

Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQISbjct: 287 FGGGAGGYTAPPGLSPQI 304


>Erythroid transcription factor (NF-E1 DNA-binding protein)

Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLGSbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60

Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGATSbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116

Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQTSbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169

Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPGSbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226

Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGFSbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286

Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQISbjct: 287 FGGGAGGYTAPPGLSPQI 304


References1. Class web site:• http://jura.wi.mit.edu/bio/education/bioinfo2005

2. Books• David Mount - Bioinformatics: Sequence and

Genome Analysis, 2nd edition, CSHL Press, 2004.• Andreas D. Baxevanis and B. F. Francis Ouellette

(editors) - Bioinformatics: A Practical Guide to theAnalysis of Genes and Proteins, 3rd edition, Wiley,2004.

Documents

Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,