PSI-BLAST and Multiple Sequence Alignments

Doug RaifordLesson 5

Dynamic programming methods Needleman-Wunsch (global alignment) Smith-Waterman (local alignment)

BLAST

Fixed: bestLinear: next bestPolynomial (n2): not badExponential (3n): very bad

BLAST fast (linear) But not as sensitive

Speed

Sensitivity

Similarity matrix Especially with

amino acids Some amino acids

have similar chemical characteristics

Similarity to all 8,000 3-mers calculated Usually ~50 are

above a threshold All of these ~50 are

considered hits when searching

MatricesPAM (Point Accepted Mutation)

Built from observed substitution rates in closely related proteins

BLOSOM (BLOck SUbstitution Matrix)Built from observed substitution rates in evolutionarily divergentproteins

PSI-BLAST (Position Specific Iterative)

Align using default similarity matrix

At each query location build a Position Specific Scoring Matrix (PSSM) based upon observed search and alignment results

Repeat with new matrix until results no longer change

Build sensitivity by specifying allowed similarity at each position

Slower, but still faster than local alignment

PSI-BLAST

Central to bioinformatics

Need for Phylogeny Protein function Protein structure▪ Structure function

Drug discovery

Some parts of proteins are very important to maintain function

Must be similar from species to species Can we spot these regions through

alignment?

atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcagacctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac

Often conserved regions are near active sights Ligand binding sights (docking) Protein-to-protein interface Important regions for tertiary structure

Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts

Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts

What if we look at more proteins Increase our confidence?But how to go about performing

multiple sequence alignment?

atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcagacctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctact-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aat--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tataattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag

Hyper-dimensional dynamic programming

Becomes exponential with respect to number of sequences O(nL) with L = number

of sequences

Determine all pair-wise distances Fast: number of l-mer

matches Slower: full global

alignments Start with closest pair

and aligns Then aligns the next closest to those

two And so on..

ClustalW: cluster-alignment

Profile: matrix of real values, representing the probability of amino acids at each position in a corresponding multiple sequence alignment

A modification of the Smith/Waterman algorithm Degree to which an aa is preferred is the

degree of match between the profile and the sequence

Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

Mistakes early in a progressive approach propagated throughout process Once aligned not

revisited Iterative methods

devised to revisit Newest version of

ClustalW (version 2) includes iteration

Other MSA apps•T-Coffee•PSalign•DIALIGN•MUSCLE

Other MSA apps•T-Coffee•PSalign•DIALIGN•MUSCLE

Height of letter represents how prevalent that letter is at that position

Database Searches 16

Scores are affected by sequence lengths If want scores that can be compared

across different query lengths need to normalize

Term “bit” comes from fact that probabilities are stored as log2 values (binary, bit) Done so can add across length of sequence

instead of multiply

Documents

PSI-BLAST and Multiple Sequence Alignments