25
1 Armstrong, 2005 Bioinformatics 2 Bio2 Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods • FASTA • BLAST Gapped BLAST • PSI-BLAST Armstrong, 2005 Bioinformatics 2 Assumptions for Heuristic Approaches Even linear time complexity is a problem for large genomes Databases can often be pre-processed to a degree Substitutions more likely than gaps Homologous sequences contain a lot of substitutions without gaps which can be used to help find start points in alignments Armstrong, 2005 Bioinformatics 2 FASTA Lipman and Pearson (1988) Improved tools for biological sequence comparison. PNAS 85: 10915-10919 Compares a query string against a single text string (i.e. for sequence databases, lots of searches) Based on the assumption that good local alignment is likely to have some exact matching subsequences The algorithm looks for these subsequences first. Armstrong, 2005 Bioinformatics 2 Dot-plot alignment We can find good subsequences just by looking for diagonal runs of matched bases: c t t g c c t g g a g t g c c c t g a a Armstrong, 2005 Bioinformatics 2 Dot-plot alignment We can find good subsequences just by looking for diagonal runs of matched bases: Mark identical hits * * * c * * t * * t * * * g * * * c * * * c * * t * * * g * * * g * * a g t g c c c t g a a

Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

1

Armstrong, 2005 Bioinformatics 2

Bio2

Lecture 3

Heuristics, Databases ; Multiple SequenceAlignment; Protein Families and Databases

Armstrong, 2005 Bioinformatics 2

Heuristic Methods

• FASTA

• BLAST

• Gapped BLAST

• PSI-BLAST

Armstrong, 2005 Bioinformatics 2

Assumptions for Heuristic Approaches

• Even linear time complexity is a problem for largegenomes

• Databases can often be pre-processed to a degree

• Substitutions more likely than gaps

• Homologous sequences contain a lot ofsubstitutions without gaps which can be used tohelp find start points in alignments

Armstrong, 2005 Bioinformatics 2

FASTA

Lipman and Pearson (1988) Improved tools for biological sequencecomparison. PNAS 85: 10915-10919

• Compares a query string against a single text string (i.e. forsequence databases, lots of searches)

• Based on the assumption that good local alignment islikely to have some exact matching subsequences

• The algorithm looks for these subsequences first.

Armstrong, 2005 Bioinformatics 2

Dot-plot alignment

• We can find goodsubsequences just bylooking for diagonalruns of matchedbases:

c

t

t

g

c

c

t

g

g

a

gtgccctgaa

Armstrong, 2005 Bioinformatics 2

Dot-plot alignment

• We can find goodsubsequences just bylooking for diagonalruns of matchedbases:

• Mark identical hits***c

**t

**t

***g

***c

***c

**t

***g

***g

**a

gtgccctgaa

Page 2: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

2

Armstrong, 2005 Bioinformatics 2

Dot-plot alignment

• We can find goodsubsequences just bylooking for diagonalruns of matchedbases:

• Find Diagonal Runs:***c

**t

**t

***g

***c

***c

**t

***g

***g

**a

gtgccctgaa

Armstrong, 2005 Bioinformatics 2

Dot-plot alignment

• We can find goodsubsequences just bylooking for diagonalruns of matchedbases:

• Compare to DPalignment: ***c

**t

**t

***g

***c

***c

**t

***g

***g

**a

gtgccctgaa

Armstrong, 2005 Bioinformatics 2

FASTA Definitions

• ktup:– (k respective tuples) – an integer value which specifies

the word length used to find matching substrings

– Standard 4-6 for DNA

– Standard 1 or 2 for proteins

– Shorter is more sensitive but slower

– Target databases can be preprocessed into ktup sizedchunks before queries are run.

Armstrong, 2005 Bioinformatics 2

FASTA Definitions

• hot spots:– The matching ktup length substrings

– Consecutive hot-spots are located along the diagonal

– See dot-plot for example of 4 length hotspots

– Often close to the dynamic programming solution

• diagonal run:– A sequence of nearby hot-spots on the same diagonal

– i.e. spaces between hot-spots are allowed

Armstrong, 2005 Bioinformatics 2

FASTA Definitions

• init1:– The best scoring run

• initn:– The best local alignment

– Combination of good diagonal runs and indels/gapsbetween them.

Armstrong, 2005 Bioinformatics 2

FASTA Process

1. Look for hot-spots:

• The stage can be done by using a look-up table ora hash.

• Pre-process the database and store the location ofeach possible ktup (AA=202, DNA=46)

• Move a ktup sized window along the querysequence and record the position of matchinglocations in the database.

Page 3: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

3

Armstrong, 2005 Bioinformatics 2

FASTA Process

2. Find best diagonal runs:

• Each hot spot gets a positive score.

• Distance between hot spots is negative and lengthdependant

• Score of the diagonal run

• Fasta finds and stores the 10 best diagonal runs

Armstrong, 2005 Bioinformatics 2

FASTA Process

3. Compute init1 & filter:

• Diagonal runs specify a potential alignment

• Evaluate properly using a substitution matrix

• Define the best scoring run as init1

• Discard any much lower scoring runs

Armstrong, 2005 Bioinformatics 2

FASTA Process

4. Combine diagonal runs and compute initn:• Take the ‘good alignments’ from previous stage• Now allow gaps/indels• Combine them into a single, better scoring

alignment– Construct a directed weighted graph

• vertices are the runs• edge weights represent gap penalties

– Find the best path through the graph = initn

Armstrong, 2005 Bioinformatics 2

FASTA Process

5. Find the best local alignment• Use the ‘alignments’ from the previous stage to

define a narrow band through the search space• Go through that band using a dynamic

programming approach• Size of the band is dependant on ktup value• The best local alignment found in this stage is

called opt

Armstrong, 2005 Bioinformatics 2

FASTA Process

6. Compare the alignments• Take the opt or initn scores for each sequence in

the database• Rank according to score• Use a full dynamic programming algorithm to

align the query sequence with the highest rankingresult sequences

Armstrong, 2005 Bioinformatics 2

FASTA Programs

• fasta3 scan a protein or DNA sequence library for similar sequences

• fastax/y3 compare a DNA sequence to a protein sequence database, comparing the translated

DNA sequence in forward and reverse frames

• tfastax/y3 compares a protein to a translated DNA data bank

• fasts3 compares linked peptides to a protein databank

• fastf3 compares mixed peptides to a protein databank

Page 4: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

4

Armstrong, 2005 Bioinformatics 2 Armstrong, 2005 Bioinformatics 2

FASTA Summary

• The alignment produced is not always optimal

• The resulting scores usually compare very wellwith the dynamic programming solutions

• FASTA is much faster than ordinary dynamicprogramming algorithms

Armstrong, 2005 Bioinformatics 2

BLAST

Altschul, Gish, Miller, Myers and Lipman (1990) Basic localalignment search tool. J Mol Biol 215:403-410

• Developed on the ideas of FASTA• Integrates the substitution matrix in the first stage

of finding the hot spots• Faster hot spot finding

Armstrong, 2005 Bioinformatics 2

BLAST definitions

• Given two strings S1 and S2

• A segment pair is a pair of equal lengthssubstrings of S1 and S2 aligned without gaps

• A locally maximal segment is a segment whosealignment score (without gaps) cannot beimproved by extending or shortening it.

• A maximum segment pair (MSP) in S1 and S2 is asegment pair with the maximum score over allsegment pairs.

Armstrong, 2005 Bioinformatics 2

BLAST Process

• Parameters:– w: word length (substrings)

– t: threshold for selecting interesting alignment scores

Armstrong, 2005 Bioinformatics 2

BLAST Process

• 1. Find all the w-length substrings from thedatabase with an alignment score >t– Each of these (similar to a hot spot in FASTA) is called

a hit

– Does not have to be identical

– Scored using substitution matrix and score compared tothe threshold t (which determines number found)

– Words size can therefore be longer without losingsensitivity: AA - 3-7 and DNA ~12

Page 5: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

5

Armstrong, 2005 Bioinformatics 2

BLAST Process

• 2. Extend hits:– extend each hit to a local maximal segment

– extension of initial w size hit may increase or decreasethe score

– terminate extension when a threshold is exceeded

– find the best ones (HSP)

• This first version of Blast did not allow gaps….

Armstrong, 2005 Bioinformatics 2

(Improved) BLAST

Altshul, Madden, Schaffer, Zhang, Zhang, Miller & Lipman(1997) Gapped BLAST and PSI-BLAST:a new generation

of protein database search programs. Nucleic AcidsResearch 25:3389-3402

• Improved algorithms allowing gaps– these have superceded the older version of BLAST

– two versions: Gapped and PSI BLAST

Armstrong, 2005 Bioinformatics 2

(Improved) BLAST Process

• Find words or hot-spots– search each diagonal for two w length words such that

score >=t

– future expansion is restricted to just these initial words

– we reduce the threshold t to allow more initial words toprogress to the next stage

Armstrong, 2005 Bioinformatics 2

(Improved) BLAST Process

• Allow local alignments with gaps– allow the words to merge by introducing gaps

– each new alignment is comprises two words with anumber of gaps

– unlike FASTA does not restrict the search to a narrowband

– as only two word hits are expanded this makes the newblast about 3x faster

Armstrong, 2005 Bioinformatics 2

PSI-BLAST

• Iterative version of BLAST for searching forprotein domains– Uses a dynamic substitution matrix

– Start with a normal blast

– Take the results and use these to ‘tweak’ the matrix

– Re-run the blast search until no new matches occur

• Good for finding distantly related sequences buthigh frequency of false-positive hits

Armstrong, 2005 Bioinformatics 2

BLAST Programs

• blastp compares an amino acid query sequence against a protein sequence database.

• blastn compares a nucleotide query sequence against a nucleotide sequence database.

• blastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence database.

• tblastn compares a protein query sequence against a nucleotidesequence database dynamically translated in all reading

frames.

• tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide

sequence database. (SLOW)

Page 6: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

6

Armstrong, 2005 Bioinformatics 2 Armstrong, 2005 Bioinformatics 2

Go try them out!

• Links to NCBI and EBI are on the course web site

• Some test sequences will be posted on the courseweb site

Armstrong, 2005 Bioinformatics 2

Alignment Heuristics

• Dynamic Programming is better but too slow

• FASTA and BLAST based on several assumptionsabout good alignments– substitutions more likely than gaps

– good alignments have runs of identical matches

• FASTA good for DNA sequences but slower

• BLAST better for amino acid sequences and prettygood for DNA, fastest.

Armstrong, 2005 Bioinformatics 2

Biological Databases (sequences)

Armstrong, 2005 Bioinformatics 2

Biological Databases

• Introduction to Sequence Databases

• Overview of primary query tools

• Demonstration of common queries

• Interpreting the results

• Overview of annotated meta level databases

• Introduction to Multiple Sequence Alignment

Armstrong, 2005 Bioinformatics 2

Biological Databases

• Introduction to Sequence Databases

• Overview of primary query tools and the databasesthey use (e.g. databases used by BLAST andFASTA)

• Demonstration of common queries

• Interpreting the results

• Overview of annotated ‘meta’ or ‘curated’databases

Page 7: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

7

Armstrong, 2005 Bioinformatics 2

DNA Sequence Databases

• Raw DNA (and RNA) sequence

• Submitted by Authors

• Patent, EST, Gemomic sequences

• Large degree of redundancy

• Little annotation

• Annotation and Sequence errors!

Armstrong, 2005 Bioinformatics 2

Main DNA DBs

• Genbank US

• EMBL EU

• DDBJ Japan

• Celera genomics Commercial DB

Armstrong, 2005 Bioinformatics 2

EMBL

• Sources for sequence include:– Direct submission - on-line submission tools

– Genome sequencing projects

– Scientific Literature - DB curators and editorialimposed submission

– Patent applications

– Other Genomic Databases, esp Genbank

Armstrong, 2005 Bioinformatics 2

International Nucleotide SequenceDatabase Collaboration

• Partners are EMBL, Genbank & DDBJ

• Each collects sequence from a variety of sources

• New additions to any of the three databases areshared to the others on a daily basis.

Armstrong, 2005 Bioinformatics 2

Limited annotation

• Unique accession number

• Submitting author(s)

• Brief annotation if available

• Source (cDNA, EST, genomic etc)

• Species

• Reference or Patent details

Armstrong, 2005 Bioinformatics 2

EMBL file tags

ID - identification (begins each entry; 1 per entry) AC - accession number (>=1 per entry) SV - new sequence identifier (>=1 per entry) DT - date (2 per entry) DE - description (>=1 per entry) KW - keyword (>=1 per entry) OS - organism species (>=1 per entry) OC - organism classification (>=1 per entry) OG - organelle (0 or 1 per entry) RN - reference number (>=1 per entry) RC - reference comment (>=0 per entry) RP - reference positions (>=1 per entry) RX - reference cross-reference (>=0 per entry) RA - reference author(s) (>=1 per entry) RT - reference title (>=1 per entry) RL - reference location (>=1 per entry) DR - database cross-reference (>=0 per entry) FH - feature table header (0 or 2 per entry) FT - feature table data (>=0 per entry) CC - comments or notes (>=0 per entry) XX - spacer line (many per entry) SQ - sequence header (1 per entry) bb - (blanks) sequence data (>=1 per entry) // - termination line (ends each entry; 1 per entry)

Page 8: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

8

Armstrong, 2005 Bioinformatics 2

16,759,535,577 bases (27/1/02)

Armstrong, 2005 Bioinformatics 2

35,602,556,374 bases (17/1/03)

Armstrong, 2005 Bioinformatics 2

53,958,991,118 bases (24/1/04)

Armstrong, 2005 Bioinformatics 2

Bases by organism 02

Armstrong, 2005 Bioinformatics 2

Bases by organism 03

Armstrong, 2005 Bioinformatics 2

Bases by organism 04

Page 9: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

9

Armstrong, 2005 Bioinformatics 2

17 SubdivisionsESTs ESTBacteriophage PHGFungi FUNGenome survey GSSHigh Throughput cDNA HTCHigh Throughput Genome HTGHuman HUMInvertebrates INVMus musculus MUSOrganelles ORGOther Mammals MAMOther Vertebrates VRTPlants PLNProkaryotes PRORodents RODSTSs STSSynthetic SYNUnclassified UNCViruses VRL

Armstrong, 2005 Bioinformatics 2

ESTs

• Expressed Sequence Tags– short mRNA samples from tissues

– cloned and sequenced

– single read

– approx 1/3 of the database

Armstrong, 2005 Bioinformatics 2

HTG

• High throughput genomic sequences– Partial sequences obtained during genome sequencing.

– Around 1/3 of the database

Armstrong, 2005 Bioinformatics 2

Specialist DNA Databases

• Usually focus on a single organism or smallrelated group

• Much higher degree of annotation

• Linked more extensively to accessory data– Species specific:

• Drosophila: FlyBase,

• C. elegans: AceDB

– Other examples include Mitochondrial DNA, ParasiteGenome DB

Armstrong, 2005 Bioinformatics 2

• Includes the entire annotated genome searchableby BLAST or by text queries

• Also includes a detailed ontology or standardnomenclature for Drosophila

• Also provides information on all literature,researchers, mutations, genetic stocks andtechnical resources.

• Full mirror at EBI

flybase.bio.indiana.edu

Armstrong, 2005 Bioinformatics 2

Protein DBs

• Primary Sequence DBs– Swiss-Prot, TrEMBL, GenPept

• Protein Structure DBs– PDB, MSD

• Protein Domain Homology DBs– InterPro, CluSTr

Page 10: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

10

Armstrong, 2005 Bioinformatics 2

Swiss-Prot

• Consists of protein sequence entries

• Contains high-quality annotation

• Is non-redundant

• Cross-referenced to many other databases

• 104,559 sequences in Jan 02

• 120,960 sequences in Jan 03

Armstrong, 2005 Bioinformatics 2

Swis-Prot by Species

------ --------- -------------------------------------------- Number Frequency Species ------ --------- -------------------------------------------- 1 8950 Homo sapiens (Human) 2 6028 Mus musculus (Mouse) 3 4891 Saccharomyces cerevisiae (Baker's yeast) 4 4835 Escherichia coli 5 3403 Rattus norvegicus (Rat) 6 2385 Bacillus subtilis 7 2286 Caenorhabditis elegans 8 2106 Schizosaccharomyces pombe (Fission yeast) 9 1836 Arabidopsis thaliana (Mouse-ear cress) 10 1773 Haemophilus influenzae 11 1730 Drosophila melanogaster (Fruit fly) 12 1528 Methanococcus jannaschii 13 1471 Escherichia coli O157:H7 14 1378 Bos taurus (Bovine) 15 1370 Mycobacterium tuberculosis

~20%

~13%

Armstrong, 2005 Bioinformatics 2

TrEMBL

• Computer annotated Protein DB

• Translations of all coding sequences in EMBLDNA Database

• Remove all sequences already in Swiss-Prot

• November 01: 636,825 peptides

• Jan 17th 2003: 728713 peptides

• TrEMBL new is a weekly update

• GenPept is the Genbank equivalent

Armstrong, 2005 Bioinformatics 2

Database Search Methods

• Text based searching of annotations and relateddata: SRS, Entrez

• Sequence based searching: BLAST, FASTA,MPSearch

Armstrong, 2005 Bioinformatics 2

SRS

• Sequence Retrieval System– Powerful search of EMBL annotation

– Linked to over 80 other data sources

– Also includes results from automated searches

Armstrong, 2005 Bioinformatics 2

SRS data sources

• Primary Sequence: EMBL, SwissProt• References/Literature: Medline• Protein Homology: Prosite, Prints• Sequence Related: Blocks, UTR, Taxonomy• Transcription Factor: TFACTOR, TFSITE• Search Results: BLAST, FASTA, CLUSTALW• Protein Structure: PDB• Also, Mutations, Pathways, other specialist DBs

Page 11: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

11

Armstrong, 2005 Bioinformatics 2

Entrez

• Text based searching at NCBI’s Genbank

• Very simple and easy to use

• Not as flexible or extendable as SRS

• No user customisation

Armstrong, 2005 Bioinformatics 2

Sequence Based Searching

• Queries:DNA query against DNA dbTranslated DNA query against Protein dbTranslated DNA query against translated DNA dbTranslated Protein query against DNA dbProtein query against Protein db

• BLAST & FASTA

Armstrong, 2005 Bioinformatics 2

BLAST

Version Query DB

Blastn DNA DNA

Blastp Peptide Peptide

Blastx DNA Peptide

tBlastn Peptide DNA

tBlastx DNA DNA

translatedArmstrong, 2005 Bioinformatics 2

BLAST Key Parameters

Database: Which DNA/Protein db to use.

Program: Blastp/Blastn etc

Matrix: Substitution score matrix e.g. Blosum62

Expect: Cut-off E value for significant scores

Scores: How many results to summarise

Alignments: How many full alignments to provide

Open Gap: Penalty for opening a new gap

Extend Gap: Penalty for extending a gap by 1

Armstrong, 2005 Bioinformatics 2

Main FASTA Programs

fasta3 dna-dna or protein-protein similarity search

fastx3 translated DNA against protein DB.

tfastx protein to translated DNA db

fasty3 slower but better versions of fastx/y3 that

tfasty3 allow frameshifts within codons

Generally use FASTA for nucleotide searches, BLAST for proteinsearches

Armstrong, 2005 Bioinformatics 2

FASTA Key Parameters

Database: Which DNA/Protein db to use.

Program: fastx3, tfasty3 etc

Matrix: Substitution score matrix e.g. Blosum50

KTUP Word length to use in search

Scores: How many results to summarise

Alignments: How many full alignments to provide

Open Gap: Penalty for opening a new gap

Extend Gap: Penalty for extending a gap by 1

Page 12: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

12

Armstrong, 2005 Bioinformatics 2

Initial Strategies

• Use a good server with up to date databases

• Run BLAST as a first choice (its quick)

• If appropriate, translated DNA or protein searchesare better.

• Refine using FASTA, SW programs or proteinprediction packages

Armstrong, 2005 Bioinformatics 2

Scores

• The raw scores returned by Blast and FASTA arenot in themselves all that useful.

• The E-Value (expect) is the number of falsepositives you would expect to find in that query. Alow E-value indicates a higher confidence level

Armstrong, 2005 Bioinformatics 2

P value

• The Probability of the observed score (probabilitythat it happened by chance) can be calculated:

P = 1 - e-E

Armstrong, 2005 Bioinformatics 2

Secondary Databases

• PDB

• Pfam

• PRINTS

• PROSITE

• ProDom

• SMART

• TIGRFAMs

Armstrong, 2005 Bioinformatics 2

PDB

• Molecular Structure Database (EBI)

• Contains the 3D structure coordinates of ‘solved’protein sequences– X-ray crystallography

– NMR spectra

• 19749 protein structures

Armstrong, 2005 Bioinformatics 2

Multiple Sequence Alignment

• What and Why?

• Dynamic Programming Methods

• Heuristic Methods

• A further look at Protein Domains

Page 13: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

13

Armstrong, 2005 Bioinformatics 2

Multiple Alignment

• Normally applied to proteins

• Can be used for DNA sequences

• Finds the common alignment of >2 sequences.

• Suggests a common evolutionary source betweenrelated sequences based on similarity– Can be used to identify sequencing errors

Armstrong, 2005 Bioinformatics 2

Multiple Alignment of DNA

• Take multiple sequencing runs

• Find overlaps– variation of ends-free alignment

• Locate cloning or sequencing errors

• Derive a consensus sequence

• Derive a confidence degree per base

Armstrong, 2005 Bioinformatics 2

Consensus Sequences

• Look at several aligned sequences and derive the mostcommon base for each position.– Several ways of representing consensus sequences

– Many consensus sequences fail to represent the variability at eachbase position.

– Largely replaced by Sequence Logos but the term is often mis-applied

Armstrong, 2005 Bioinformatics 2

Sequence Logos

• Example, from an alignment of the TATA box in yeastgenes:

We now have aconfidence levelfor each base ateach position

Armstrong, 2005 Bioinformatics 2

Multiple Alignment of Proteins

• Multiple Alignment of Proteins

• Identify Protein Families

• Find conserved Protein Domains

• Predict evolutionary precursor sequences

• Predict evolutionary trees

Armstrong, 2005 Bioinformatics 2

Protein Families

• Proteins are complex structures built fromfunctional and structural sub-units– When studying protein families it is evident that some

regions are more heavily conserved than others.– These regions are generally important for the structure

or function of the protein– Multiple alignment can be used to find these regions– These regions can form a signature to be used in

identifying the protein family or functional domain.

Page 14: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

14

Armstrong, 2005 Bioinformatics 2

Protein Domains

• Evolution conserves sequence patterns due tofunctional and structural constraints.

• Different methods have been applied to theanalysis of these regions.

• Domains also known by a range of other names:

motifs patterns prints blocks

Armstrong, 2005 Bioinformatics 2

Multiple Alignment

• OK we now have an idea WHY we want to tryand do this

• What does a multiple alignment look like?

• How could we do multiple alignments

• What are the practical implications

Armstrong, 2005 Bioinformatics 2

Multiple alignment table

P EICHFLHTR EGNQHAAER NSASGVLF EICHFLHTC EGNQHAAEL NSASGVLFS EISQFVDVQ DSNPNAAQL ALASKVLFA EITQHLFLQ VKKQILDKV YCPPEASSVQ EITQHLFLQ VKKQILDKI YCPPEAS2

A consensus character is the one that minimises the distancebetween it and all the other characters in the column

Conservative or Identical residues are highlighted

Armstrong, 2005 Bioinformatics 2

Scoring Multiple Alignments

• We need to score on columns with more than 2 bases orresidues:

SCAPP

( )ColumnCost = 24

Multiple alignments are usually scored on cost/differencerather than similarity

Armstrong, 2005 Bioinformatics 2

Column Costs

• Several strategies exist for calculating the columncost in a multiple alignment

• Simplest is to sum the pairwise costs of eachbase/residue pair in the column using a matrix(e.g. PAM250).

• Gap scoring rules can be applied to these as well.

Armstrong, 2005 Bioinformatics 2

Scoring Multiple Alignments

• Score = (S,C)+(S,A)+(S,A)+(S,P)+(S,P)+(C,A)+(C,P)+(C,P)+(A,P)+(A,P)+(P,P)

SCAPP

( )ColumnCost = 24

Known as the sum-of-pairs scoring method

Page 15: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

15

Armstrong, 2005 Bioinformatics 2

Sum-of-pairs cost method (SP)

• Score = (S,C)+(S,-)+(S,A)+(S,P)+(S,P)+ (-,A)+(-,P)+(-,P)+(A,P)+(A,P)+(P,P)

S-APP

( )ColumnCost = 24

Still works with gaps using whatever gap penalty you wantArmstrong, 2005 Bioinformatics 2

Multiple Alignment Cost

• Sum of pairs is a simple method to get a score foreach column in a multiple alignment

• Based on matrices and gap penalties used forpairwise sequence alignment

• The score of the alignment is the sum of eachcolumn

Armstrong, 2005 Bioinformatics 2

Optimal Multiple Alignment

• The best alignment is generally the one with thelowest score (i.e. least difference)– depends on the scoring rules used.

• Like pairwise cases, each alignment represents apath through a matrix

• For multiple alignment, the matrix is n-dimensional– where n=number of sequences

Armstrong, 2005 Bioinformatics 2

(Murata, Richardson and Sussman 1999)

Armstrong, 2005 Bioinformatics 2

Contrasting pairwise and multiple alignments

Lets compare pairwise with three sequences.

(0,0) (1,-)

(-,1) (1,1)

Armstrong, 2005 Bioinformatics 2

Contrasting pairwise and multiple alignments

Lets compare pairwise with three sequences.

(0,0) (1,-)

(-,1)

(-,-,1)

(0,0,0)

(-,1,-) (1,1,-)

(1,-,-)

(1,-,1)

(1,1,1)(-,1,1)

Page 16: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

16

Armstrong, 2005 Bioinformatics 2

NP-Completeness

• A problem is solvable in polynomial time if analgorithm exists O(nc)– c - some constant

– n - size of the input

• Pairwise alignment is solvable in polynomial timeO(n2)

• More difficult problems are NP-complete

Armstrong, 2005 Bioinformatics 2

Multiple alignment complexity

• For k sequences of average length n

• k dimension matrix has (n+1)k cells to compute.

• Each entry can be computer in 2k time

• Running time of the overall algorithm is:

O((2n)k)

• The real problem hits when considering protein sequencesaverage ~400 residues

Armstrong, 2005 Bioinformatics 2

MA: Dynamic Programming

• We can use dynamic programming in some smallcases.

• For x sequences, build an x dimensionalhypercube.

• Solve as before using gap and substitutionpenalties but remembering that there are moreroutes to each cell in the hypercube

Armstrong, 2005 Bioinformatics 2

MA: Dynamic Programming

• Space complexity is huge:– O(sum sequences x ave length)

• Computational complexity is huge

• In practice the DP method is only feasible forsmall numbers of short strings

Armstrong, 2005 Bioinformatics 2

NF1 fragment

gi|13959414|sp|P97526|NF1_RAT 107 P EICHFLHTR EGNQHAAER NSASGVLF SCNN----- ---FNAVFR ISTRLQELgi|15313417|ref|XP_053355.1| 109 EICHFLHTC EGNQHAAEL NSASGVLFS SCNN----- ---FNAVFS ISTRLQELTgi|17738199|ref|NP_524502.1| 109 EISQFVDVQ DSNPNAAQL ALASKVLFA SQNH----- ---FSAVFN ISARIQELTgi|1170925|sp|P46662|MERL_MOUS 97 VQ EITQHLFLQ VKKQILDKV YCPPEASS ASYA----- ---VQAKGD YDPSVHKgi|18044276|gb|AAH20257.1|AAH2 95 VQ EITQHLFLQ VKKQILDKI YCPPEAS2 ASYA----- ---VQAKGD YDPSVHKgi|18307455|emb|CAD21515.1| 121 PAQDLYHDDIISITRSTLIRVSAGSVNLVLDALLGLLEDLARPYRSIFNHPPLVLLSEIYconsensus 121 eif hv l kh aler as l s v k lsel

gi|13959414|sp|P97526|NF1_RAT 152 V CSEDNVDVD IELLQYIN --VDCAKLR LLKETAFKK ALKKVAQLV INSLEKAFgi|15313417|ref|XP_053355.1| 155 CSEDNVDVH IELLQYIN- --VDCAKLK LLKETAFKF ALKKVAQLA INSLEKAFWgi|17738199|ref|NP_524502.1| 155 CSEENPDYN IELIQHID- --MDMIKLT LLQETITKF S-KRAPPLI LYSLEKAIWgi|1170925|sp|P46662|MERL_MOUS 142 GF LAQEELLKR VINLYQMS --MWEERTA WYAEHRGAR DEAEMEYKI AQDLEMYgi|18044276|gb|AAH20257.1|AAH2 139 GF LAQEELLKR VINLYQM2 --MWEERTA WYAEHRGAR DEAEMEYKI AQDLEMYgi|18307455|emb|CAD21515.1| 181 ALELIADCCASHWGPSHTENGLGDSYALSSAQRAALPPPKPLDDVLVRRTFEVIKLLFDPconsensus 181 g lenedvv r liq yn m k l k dka i el if

gi|13959414|sp|P97526|NF1_RAT 203 N WVENYPDE KLYQIPQTM AECAGKLFL VDGFAESTR -KAAVWPLI ILLILCPEgi|15313417|ref|XP_053355.1| 206 WVENYPDEF KLYQIPQTD AECAEKLFD VDGFAESTK -KAAVWPLQ ILLILCPEIgi|17738199|ref|NP_524502.1| 205 WIEYHPQEF DLQRGTNRD STCWEPLMD VEYFKTENK SKTLVWPLQ LLLILNPSCgi|1170925|sp|P46662|MERL_MOUS 193 VN YFTIRNKS ELLLGVDLG LHIYDPERL TPKISFPNE IRNISYSKE FTIKPLDgi|18044276|gb|AAH20257.1|AAH2 189 VN YFAIRNK2 ELLLGVDLG LHIYDPERL TPKISFPNE IRNISYSKE FTIKPLDgi|18307455|emb|CAD21515.1| 241 YPDGYTLPAKTLLDETTAKPAAVRLPDEVMRTPVSTPSNEPLESSHLLQLHATAIEAHVKconsensus 241 y i dk le iig dma dldrv k s v k ilii ple

Armstrong, 2005 Bioinformatics 2

Center Star Method

• Given a set of Strings, define the center string Scas the string that minimises the sum of distancesfrom all other sequences.– Found Sc

– Consecutively add on the other sequences so that thealignment of each is optimal.

– Add spaces where needed to all prealigned sequences

• The center star method is within 2 fold accuracyof true dynamic solution

Page 17: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

17

Armstrong, 2005 Bioinformatics 2

Iterative pairwise alignment

• In CSA we try to align the chosen center stringwith all the others in no particular order.

• Often some of the other sequences will be closerto each other and form clusters

• Tricky part is deciding how to define close andhow to cluster them

Armstrong, 2005 Bioinformatics 2

How do we cluster sequences?

Armstrong, 2005 Bioinformatics 2

Building trees

• Need to define how the sequences are related toone another.

• Most use the distances between pairs in the set ofsequences.

• Key parameter is in defining the distance score.

Armstrong, 2005 Bioinformatics 2

Clustering Methods

• Unweighted Pair Group Method using Arithmeticaverages or UPGMA

• Simple and based on distance pairs

• Each stage joins two clusters creating a new node

Armstrong, 2005 Bioinformatics 2

An example tree

1 2 3

Sequences 1 and 2 are the closest related.Each sequence lies on its own leaf

Armstrong, 2005 Bioinformatics 2

UPGMA

• Assign each sequence to its own cluster• Create a leaf at height zero for each cluster• Determine the two closest clusters• Align the two sequences to define a new cluster at

the next level up.• Remove the two pre existing clusters and start

over.• End at two clusters

Page 18: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

18

Armstrong, 2005 Bioinformatics 2

Drawbacks of UPGMA

Correct tree Incorrect by UPGMA

1 4 1 4 2 3

2 3

Armstrong, 2005 Bioinformatics 2

Nearest Neighbour

• Similar to the UPGMA algorithm

• UPGMA works on distance between sequencepairs alone

• Nearest Neighbour compensates for the paththrough the tree to correct situations wheredistance alone would incorrectly pair twosequences

Armstrong, 2005 Bioinformatics 2

Back to multiple alignment heuristics

Armstrong, 2005 Bioinformatics 2

Iterative Pairwise Alignment

• Can be used as a strategy for growing groups ofprofiles from multiple sequences

• This approach uses pairwise alignment scores toadd one additional sequence at a time to a growingmultiple alignment.

Armstrong, 2005 Bioinformatics 2

Iterative Pairwise Alignment

• First align all pairs of strings where one is alreadyin a multiple alignment and one is aligned.

• Find the closest matches.

• Align the unassigned sequence with the familyprofile of the closest group

• Realign the group and get a new profile.

Armstrong, 2005 Bioinformatics 2

Feng-Doolittle

• Feng-Doolittle 1987 Journal of Molecular Evolution25:351-360

• The key principal is that the two most similar sequences ina multiple alignment are the most recently diverged.

• Therefore the pairwise alignment of these two sequences isthe most reliable of the entire group

• Gaps present in the alignment should therefore bepreserved in the multiple alignment.

Page 19: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

19

Armstrong, 2005 Bioinformatics 2

Feng-Doolittle

• Calculate the pairwise alignment scores for each sequence

• Construct a tree using these distances

• Traverse the nodes of the tree in order of addition (mostsimilar first)

• Progressively align the sequences starting with the mostsimilar:

• Once a gap is established in the multiple alignment it stays.

Armstrong, 2005 Bioinformatics 2

ClustalW

• Uses a modification of the Feng-Doolittlealgorithm

• Very common software package for multiplealignment

Armstrong, 2005 Bioinformatics 2

ClustalW

• Starts by calculating pairwise alignments and convertingscores to distances

• Uses a neighbour joining algorithm to build a tree from thedistances

• Aligns sequences to each other• Aligns sequences to profiles• Aligns profiles to profiles• Can output multiple alignment as well a predicted

evolutionary tree

Armstrong, 2005 Bioinformatics 2

MSA

• Exploits the fact that closely aligned sequencepaths will be close to the main diagonal on a DPtable.

• Estimates a good solution, removes cells from thehypercube where the score could not feasibly passthrough them.

Armstrong, 2005 Bioinformatics 2

CAP

• Contig Assembly Program

• Designed to optimise alignments between multipleDNA sequences that are suspected to overlap.

• Uses a fast heuristic prescreen then finishes usinga dynamic programming approach.

Armstrong, 2005 Bioinformatics 2

CAP

• Takes all the sequences and split into shortfragments

• Eliminate fragment pairs that could not possiblyoverlap

• The dynamic programming algorithm is used tofind the maximal scoring overlaps

• Scores are weighted so that sequencing errors arelow cost and mutations higher

Page 20: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

20

Armstrong, 2005 Bioinformatics 2

Consensus Sequences

• The consensus sequence is the concatenation ofthe consensus characters

• The alignment error of the multiple alignment isthe sum of the distance costs of each consensuscharacter in the consensus sequence.

Armstrong, 2005 Bioinformatics 2

Scoring Multiple Alignments

• Distance from Consensus– In each column, count the number of characters that are

different from the consensus sequence.

• Sum of Pairs (covered already)– Sum the pairwise distances between all sequence pairs

Armstrong, 2005 Bioinformatics 2

Scoring Multiple Alignments

• Evolutionary Tree alignment– The weight of the lightest tree that can be constructed

from the sequences

– The weight is defined as the the number of changes thatcorrespond to two adjacent nodes in the tree summedover all pairs.

Armstrong, 2005 Bioinformatics 2

Consensus Sequences

• Given an optimal alignment between >2sequences, how do we find the consensussequence?

• Take a multiple alignment in columns ofcharacters

Armstrong, 2005 Bioinformatics 2

Multiple alignment table

P EICHFLHTR EGNQHAAER NSASGVLF EICHFLHTC EGNQHAAEL NSASGVLFS EISQFVDVQ DSNPNAAQL ALASKVLFA EITQHLFLQ VKKQILDKV YCPPEASSVQ EITQHLFLQ VKKQILDKI YCPPEAS2

The consensus character is the one that minimises the distancebetween it and all the other characters in the column

Armstrong, 2005 Bioinformatics 2

Profiles, domains etc

Page 21: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

21

Armstrong, 2005 Bioinformatics 2

Profiles

• Given a sequence, we often want to assign thesequence to a family of known sequences

• We often also want to assign a subsequence to afamily of subsequences.

Armstrong, 2005 Bioinformatics 2

Profiles

• Examples include assigning a gene/protein to aknown gene/protein family, e.g.– G coupled receptors

– actins

– globins

Armstrong, 2005 Bioinformatics 2

Profiles

• Also we may wish to find known protein domainsor motifs that give us clues about structure andfunction– Phosphorylation sites (regulated site)

– Leucine zipper (dna binding)

– EGF hand (calcium binding)

Armstrong, 2005 Bioinformatics 2

Creating Profiles

• Aligning a sequence to a single member of thefamily is not optimal

• Create profiles of the family members and testhow similar the sequence is to the profile.

• A profile of a multiply aligned protein familygives us letter frequencies per column.

Armstrong, 2005 Bioinformatics 2

Matching sequences to profiles

• We can define a distance/similarity cost for a basein each sequence being present at any locationbased on the probabilities in the profile.

• We define define costs for opening and extendinggaps in the sequence or profile.

• Therefore we can essentially treat the alignment ofa sequence to a profile as a pairwise alignment anduse dynamic programming algorithms to find andscore the optimal alignments.

Armstrong, 2005 Bioinformatics 2

Hidden Markov Models

• Widely used in bioinformatics

• Based on probabilistic models that can be derivedby training against real data

• Common uses:– Protein profiles

– Gene prediction

– Protein structure prediction

Page 22: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

22

Armstrong, 2005 Bioinformatics 2

Protein profiles

• Multiple alignments can be used to give aconsensus sequence.

• The columns of characters above each entry in theconsensus sequence can be used to derive a tableof probabilities for any amino acid or base at thatposition.

Armstrong, 2005 Bioinformatics 2

Protein profiles

• The table of percentages forms a profile of theprotein or protein subsequence.

• With a gap scoring approach - sequence similarityto a profile can be calculated.

• The alignment and similarity of a sequence /profile pair can be calculated using a dynamicprogramming algorithm.

Armstrong, 2005 Bioinformatics 2

Protein profiles

• Alternative approaches use statistical techniques toassess the probability that the sequence belongs toa family of related sequences.

• This is calculated by multiplying the probabilitiesfor amino acid x occurring at position y along thesequence/profile.

Armstrong, 2005 Bioinformatics 2

Probabilistic models

• Protein sequences are over 300 ave length.

• Random amino acid probability is 0.05

• Multiplying low probabilities together can causeunderflow errors.

• Move into log space:– Take the log of the probabilities and sum.

Armstrong, 2005 Bioinformatics 2

HMMs

• A hidden markov model (HMM) is a refinementof this approach:

• HMMs can be visualised as finite state machineswith a begin and an end state.

• FSMs move through a series of state emittingsome kind of output report either at the end orduring a transition from one state to another.

Armstrong, 2005 Bioinformatics 2

Protein profile HMMs

• In the profile of a protein sequence, there areeffectively 3 states the model can be in:– 1. Match (exact or substitution)

– 2. Insertion

– 3. Deletion

Page 23: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

23

Armstrong, 2005 Bioinformatics 2

Scoring profile HMMs

• The score of a sequence is the product of theprobabilities that describe the path taken throughthe model used to recreate the sequence.

• Again, a log transformation allows the log of theprobabilities to be summed rather than theprobabilities multiplied.

Armstrong, 2005 Bioinformatics 2

Calculating the path

• However, normally the path is not known and hasto be calculated.

• In most cases, there are multiple paths through themodel that can produce a protein sequence.

• The most likely path is calculated using theViterbi algorithm.

Armstrong, 2005 Bioinformatics 2

The Viterbi algorithm

• Solved by creating a matrix.

• The columns are the states of the model– show insertion and match states

– deletions do not emit sequence so can be ignored

• The rows are formed by the sequence

Armstrong, 2005 Bioinformatics 2

Viterbi algorithm

• The probability that the first amino acid wasemitted by state I0 is calculated and forms theorigin of the table.

• The probabilities that amino acid 2 is emitted instate M1 or I1 are entered into the matrix.

• The maximum probability is calculated and apointer back to I0 is set up.

• Repeat until the matrix is filled.

Armstrong, 2005 Bioinformatics 2

Scoring HMMs

• Once the most probable path is known, theprobability of a sequence given the model can becalculated by multiplying all probabilities alongthe optimal path.

• The forward algorithm is similar except that nopointers are maintained

Armstrong, 2005 Bioinformatics 2

What do HMM score mean?

• The probability that a sequence could be createdby a model is related to the score.

• A high score indicates that the sequence was mostlikely a member of the HMM profile used.

• A low score suggests it was not.

Page 24: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

24

Armstrong, 2005 Bioinformatics 2

Viterbi cont’d

• Like the Dynamic programming algorithms forpairwise alignment, viterbi gives a global typescore.

• By looking for the maximum score in the tableyou can find local regions that match the model.

Armstrong, 2005 Bioinformatics 2

Building HMMs

• The Baum-Welch algorithm• Starts with a guess at the parameters• The training set is scored through all possible

paths through the model• New emission and transmission probabilities are

then calculated and the model re-run• The continues until the parameter change between

iterations is below a set threshold.

Armstrong, 2005 Bioinformatics 2

Tools for HMM profile searches

• Meme and Mast at UCSD (SDSC)

• http://meme.sdsc.edu/

• MEME– input: a group of sequences

– output: profiles found in those sequences

• MAST– input: a profile and sequence database

– output: locations of the profile in the database

Armstrong, 2005 Bioinformatics 2

Summary

• Multiple alignment is used to define and findconserved features within DNA and proteinsequences

• Profiles of multiply aligned sequences are a betterdescription and can be searched using pairwisesequence alignment.

• Many different programs and databases available.

Armstrong, 2005 Bioinformatics 2

Pfam

• Database of protein domains

• Multiple sequence alignments and profile HMMs

• Entries also annotated

• Swiss-Prot DB all pre-searched

• New sequences can be searched as well.– 2216 in Pfam in January 02

– 5049 in Pfam in January 03

Armstrong, 2005 Bioinformatics 2

PRINTS

• Database of ‘protein fingerprints’

• Group of motifs that combined can be used tocharacterise a protein family

• ~10,000 motifs in PRINTS DB

• Provide more info than motifs alone

Page 25: Bio2 - The University of Edinburgh · Lecture 3 Heuristics, Databases ; Multiple Sequence Alignment; Protein Families and Databases Armstrong, 2005 Bioinformatics 2 Heuristic Methods

25

Armstrong, 2005 Bioinformatics 2

InterPro

• EBI managed DB

• Incorporates most protein structure DBs

• Unified query interface and a single results output.

Armstrong, 2005 Bioinformatics 2

InterPro

DATABASE VERSION ENTRIES DATESWISS-PROT 40.33 118357 09-NOV-2002PRINTS 33.0 1650 24-JAN-2002TREMBL 22.2 732596 15-NOV-2002PFAM 7.7 4832 17-OCT-2002PROSITE 17.25 1587 04-NOV-2002PREFILE N/A 162 04-NOV-2002PRODOM 20001.3 1346 28-JAN-2002SMART 3.4 654 07-OCT-2002TIGRFAMs 2.1 1614 12-SEP-2002

Next update: March 2003 will include the PIR and SITE databases

Armstrong, 2005 Bioinformatics 2

Biological Databases

• Range of primary sequence databases available

• Coordinated efforts to make each non redundantand synchronised

• Many species have their own specialised databases

• Secondary databases catalogue protein structure,protein families, motifs, domains, fingerprints etc.

• Secondary databases often manually curated.

• Recent efforts to combine the range of secondaryDBs