Upload
gigi
View
79
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Doug Raiford Lesson 5. PSI-BLAST and Multiple Sequence Alignments. Left off…. Dynamic programming methods Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) BLAST. Fixed: best Linear: next best Polynomial (n 2 ): not bad Exponential (3 n ): very bad. But…. - PowerPoint PPT Presentation
Citation preview
Doug RaifordLesson 5
Dynamic programming methods Needleman-Wunsch (global alignment) Smith-Waterman (local alignment)
BLAST
Fixed: bestLinear: next bestPolynomial (n2): not badExponential (3n): very bad
BLAST fast (linear) But not as sensitive
Speed
Sensitivity
Similarity matrix Especially with
amino acids Some amino acids
have similar chemical characteristics
Similarity to all 8,000 3-mers calculated Usually ~50 are
above a threshold All of these ~50 are
considered hits when searching
MatricesPAM (Point Accepted Mutation)
Built from observed substitution rates in closely related proteins
BLOSOM (BLOck SUbstitution Matrix)Built from observed substitution rates in evolutionarily divergentproteins
PSI-BLAST (Position Specific Iterative)
Align using default similarity matrix
At each query location build a Position Specific Scoring Matrix (PSSM) based upon observed search and alignment results
Repeat with new matrix until results no longer change
Build sensitivity by specifying allowed similarity at each position
Slower, but still faster than local alignment
PSI-BLAST
Central to bioinformatics
Need for Phylogeny Protein function Protein structure▪ Structure function
Drug discovery
Some parts of proteins are very important to maintain function
Must be similar from species to species Can we spot these regions through
alignment?
atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcagacctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac
Often conserved regions are near active sights Ligand binding sights (docking) Protein-to-protein interface Important regions for tertiary structure
Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts
Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts
What if we look at more proteins Increase our confidence?But how to go about performing
multiple sequence alignment?
atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcagacctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctact-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aat--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tataattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag
Hyper-dimensional dynamic programming
Becomes exponential with respect to number of sequences O(nL) with L = number
of sequences
Determine all pair-wise distances Fast: number of l-mer
matches Slower: full global
alignments Start with closest pair
and aligns Then aligns the next closest to those
two And so on..
ClustalW: cluster-alignment
Profile: matrix of real values, representing the probability of amino acids at each position in a corresponding multiple sequence alignment
A modification of the Smith/Waterman algorithm Degree to which an aa is preferred is the
degree of match between the profile and the sequence
Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33
Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33
Mistakes early in a progressive approach propagated throughout process Once aligned not
revisited Iterative methods
devised to revisit Newest version of
ClustalW (version 2) includes iteration
Other MSA apps•T-Coffee•PSalign•DIALIGN•MUSCLE
Other MSA apps•T-Coffee•PSalign•DIALIGN•MUSCLE
Height of letter represents how prevalent that letter is at that position
Database Searches 16
Scores are affected by sequence lengths If want scores that can be compared
across different query lengths need to normalize
Term “bit” comes from fact that probabilities are stored as log2 values (binary, bit) Done so can add across length of sequence
instead of multiply