16
Doug Raiford Lesson 5

PSI-BLAST and Multiple Sequence Alignments

  • Upload
    gigi

  • View
    79

  • Download
    0

Embed Size (px)

DESCRIPTION

Doug Raiford Lesson 5. PSI-BLAST and Multiple Sequence Alignments. Left off…. Dynamic programming methods Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) BLAST. Fixed: best Linear: next best Polynomial (n 2 ): not bad Exponential (3 n ): very bad. But…. - PowerPoint PPT Presentation

Citation preview

Page 1: PSI-BLAST and Multiple Sequence Alignments

Doug RaifordLesson 5

Page 2: PSI-BLAST and Multiple Sequence Alignments

Dynamic programming methods Needleman-Wunsch (global alignment) Smith-Waterman (local alignment)

BLAST

Fixed: bestLinear: next bestPolynomial (n2): not badExponential (3n): very bad

Page 3: PSI-BLAST and Multiple Sequence Alignments

BLAST fast (linear) But not as sensitive

Speed

Sensitivity

Page 4: PSI-BLAST and Multiple Sequence Alignments

Similarity matrix Especially with

amino acids Some amino acids

have similar chemical characteristics

Similarity to all 8,000 3-mers calculated Usually ~50 are

above a threshold All of these ~50 are

considered hits when searching

MatricesPAM (Point Accepted Mutation)

Built from observed substitution rates in closely related proteins

BLOSOM (BLOck SUbstitution Matrix)Built from observed substitution rates in evolutionarily divergentproteins

Page 5: PSI-BLAST and Multiple Sequence Alignments

PSI-BLAST (Position Specific Iterative)

Align using default similarity matrix

At each query location build a Position Specific Scoring Matrix (PSSM) based upon observed search and alignment results

Repeat with new matrix until results no longer change

Build sensitivity by specifying allowed similarity at each position

Slower, but still faster than local alignment

PSI-BLAST

Page 6: PSI-BLAST and Multiple Sequence Alignments

Central to bioinformatics

Need for Phylogeny Protein function Protein structure▪ Structure function

Drug discovery

Page 7: PSI-BLAST and Multiple Sequence Alignments

Some parts of proteins are very important to maintain function

Must be similar from species to species Can we spot these regions through

alignment?

atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcagacctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac

Page 8: PSI-BLAST and Multiple Sequence Alignments

Often conserved regions are near active sights Ligand binding sights (docking) Protein-to-protein interface Important regions for tertiary structure

Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts

Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts

Page 9: PSI-BLAST and Multiple Sequence Alignments

What if we look at more proteins Increase our confidence?But how to go about performing

multiple sequence alignment?

atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcagacctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctact-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aat--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tataattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag

Page 10: PSI-BLAST and Multiple Sequence Alignments

Hyper-dimensional dynamic programming

Becomes exponential with respect to number of sequences O(nL) with L = number

of sequences

Page 11: PSI-BLAST and Multiple Sequence Alignments

Determine all pair-wise distances Fast: number of l-mer

matches Slower: full global

alignments Start with closest pair

and aligns Then aligns the next closest to those

two And so on..

ClustalW: cluster-alignment

Page 12: PSI-BLAST and Multiple Sequence Alignments

Profile: matrix of real values, representing the probability of amino acids at each position in a corresponding multiple sequence alignment

A modification of the Smith/Waterman algorithm Degree to which an aa is preferred is the

degree of match between the profile and the sequence

Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

Page 13: PSI-BLAST and Multiple Sequence Alignments

Mistakes early in a progressive approach propagated throughout process Once aligned not

revisited Iterative methods

devised to revisit Newest version of

ClustalW (version 2) includes iteration

Other MSA apps•T-Coffee•PSalign•DIALIGN•MUSCLE

Other MSA apps•T-Coffee•PSalign•DIALIGN•MUSCLE

Page 14: PSI-BLAST and Multiple Sequence Alignments

Height of letter represents how prevalent that letter is at that position

Page 15: PSI-BLAST and Multiple Sequence Alignments
Page 16: PSI-BLAST and Multiple Sequence Alignments

Database Searches 16

Scores are affected by sequence lengths If want scores that can be compared

across different query lengths need to normalize

Term “bit” comes from fact that probabilities are stored as log2 values (binary, bit) Done so can add across length of sequence

instead of multiply