A Proteomics Toolkit:

A Proteomics A Proteomics Toolkit:Toolkit:

UniProt, InterPro and IntAct UniProt, InterPro and IntAct Databases at the EBIDatabases at the EBI

Hinxton,U.K.

European Bioinformatics InstituteEuropean Bioinformatics Institute

Created as part of the EMBL in 1992

• To house EMBL Nucleotide Sequence Data Library established in 1980

Today, 3 databases accept primary nucleotide data:

(http://www.ebi.ac.uk/)

EBI (EMBL)EBI (EMBL) EMBL

CIB (NIG)CIB (NIG)

DDBJ

NCBI (NIH)NCBI (NIH)GenBank

EMBL-EBI EMBL-EBI maintains the maintains the world’s most world’s most

comprehensive comprehensive range of range of

molecular molecular databasesdatabases

European Bioinformatics InstituteEuropean Bioinformatics Institute(http://www.ebi.ac.uk/)

Nucleotide Sequence Database

Database of Protein Families and Domains

ArrayExpress

Alternative Splicing Database

Protein Sequence Database

Molecular Structure Database

Alternative Transcript Diversity

Automatic Annotation of Genomes

Protein Interaction Database

Chemical Entities of Biological

Interest

Gene Ontology

Enzyme Database

Database of Biological Processes

http://www.ebi.ac.uk/services/

Roles of Public Domain DatabasesRoles of Public Domain Databases

To provide stable, long-term sources of basic information

To react in the long-term for the needs of the community

To act as repositories for published information

To bridge the gap between multiple data sources

Protein DatabasesProtein Databases

UniProtUniProt Database of Protein Sequences

InterPro InterPro Database of Protein Families and Domains

IntAct IntAct Database of Protein Interactions

World's most comprehensive catalogue of information on proteins

Funded mainly by NIH

A central repository of protein sequence and function

Based on the original work of PIR, Swiss-Prot and TrEMBL

UniProtUniProt

Met-Gln-Pro-Glu-Glu-Gly-Thr-Gly-Trp-Leu-Leu-Glu-Val-Gln-Gln-

Met-Gly-Arg-Gly-Arg-Cys-Val-Gly-Pro-Ser-Leu-Gln-Glu-Trp-Arg-

protein sequencingprotein sequencing

annotationannotation Swiss-Prot

EMBL

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCTGTGATAGCGCTGATCGTGATGCGTATGCAGGTCGT

nucleotide sequencingnucleotide sequencing

Swiss-Prot

EMBL

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCTGTGATAGCGCTGATCGTGATGCGTATGCAGGTCGT

nucleotide sequencingnucleotide sequencing

TrEMBL

translated EMBLtranslated EMBL

annotationannotation

UniProUniPrott

PSD

PIRPIR

annotation

++

EBIEBI

UniProt Consortium

UniProtUniProt

UniProt Reference ClustersUniProt Reference Clusters (UniRef)

UniProt KnowledgebaseUniProt Knowledgebase (UniProt)

UniProt ArchiveUniProt Archive (UniParc)

3 Components:3 Components:

UniProtUniProt





• Central repository for annotated protein sequences

UniProtUniProt




• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated



UniProtUniProt






UniProt Reference ClustersUniProt Reference Clusters (UniRef)• Combines related sequences for speed searching

UniProtUniProt





UniProt Reference ClustersUniProt Reference Clusters (UniRef)• Combines related sequences for speed searching• UniRef100, UniRef90, UniRef50


UniProtUniProt





• Combines related sequences for speed searching

• Comprehensive repository for history of sequences

• Central repository for annotated protein sequences• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated

• UniRef100, UniRef90, UniRef50

UniProt Explicit Links

SequenceEMBL/GenBank/DDBJPIR

PTM GlycoSuiteDBPhosSite

StructureHSSPPDBMSD

Domains, Sites, FamiliesGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTTIGRFAM

2D-gel ElectrophoresisANU-2DPAGEAarhus/Ghent-2DPAGECOMPLUYEAST-2DPAGEECO2DPAGEHSC-2DPAGEMAIZE-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGESiena-2DPAGESWISS-2DPAGE

Molecular InteractionIntActTRANSFAC

DatabasesDatabasescross-referencedcross-referenced

in UniProtin UniProt

MiscellaneousEnsemblGermOnlineGene OntologyMEROPS

Organism-SpecificAGDdbSNPDictyBaseEcoGeneEchoBASEFlyBaseGeneDB_SpombeGeneFarmGenewGrameneHIVH-InvDBLegioListLepromaListiListMaizeDBMGDMypuListOMIMPhotoListReactomeRGDSagaListSGDStyGeneSubtiListTAIRTIGRTubercuListWormBaseWormPepZFIN


Search tools include:

• Text Search

http://www.ebi.uniprot.org/index.shtml

• Blast, Fasta and MPsrch

• Links to extra search services (including SRS)

• Power Search

Searching UniProtSearching UniProt

http://www.ebi.uniprot.org/index.shtml

• Text-based searching• Logical operators ‘&’ (and), ‘|’ (or) • (Wildcards and numerical operators not allowed)

• Text Search – keyword queries• Power Search – can search for specific entry lines• Warehouse Search – link query to other databases

Text Search ResultsText Search Results

Each linked to the UniProt entry

• Sequence-based searching• BLAST, Fasta, MPsrch

Sequence Search ResultsSequence Search Results

UniProt entry

Identity score

View alignments

Manipulate multiple data sets

Use Venn diagrams to combine, intersect, or

subtract multiple data sets

Build complex data sets

UniProt/Swiss-Prot entry for UniProt/Swiss-Prot entry for human ubiquitin-protein ligase E3 human ubiquitin-protein ligase E3

mdm2mdm2

Some literature search engines pull

synonyms from UniProt for more

complete searching

Merged entries:• Remove redundancy• Can still be searched

IntAct Database

Summary of nucleotide data

upon which entry is originally basedStructural data associated with entry protein

IntAct Database

IntAct Database

All the interactions with

entry protein

IntAct Database

IntAct Database

IntAct Database

Experimental information

Experimental name Experimental technique:

co-immunoprecipitation

Literature citation used for curationTaxonomic Reference

Interaction information

Links to interacting protein

IntAct Database

Displays interactions graphically

View all 7 interactions involving MDM2

View all GO interactions involving MDM2

View all InterPro entries associated with MDM2

Expand graph to see network surrounding one protein

Expand graph to see entire network

View interactions associated with both MDM2 and p53

View all proteins in a network associated with a specific GO term

All protein in red associated with “negative regulation of cell proliferation”

Genomic location

Complete nucleotide sequence

SNP information

Transcript and protein information

Transcript structure

Interactive map. Can zoom in/out, and move around

Summary and links to information about processes involving this molecule (here

cell-cycle checkpoints)GeneralSpecific

Mendelian Inheritance in Man

Cellular componentMolecular functionBiological process

InterPro Database

• Allow searching for terms• Linked to GO

Domain organisation

Position of motifs and sites

Positions of variable splicing

Experimental mutation information

Sequencing conflicts

Secondary structure

Easy navigation between UniProt/UniParc/UniRef

Useful for cut/paste into search engines

UniProt/TrEMBLUniProt/TrEMBL

>2.5 M entries in TrEMBL

Doubled since mid-2004 Doubled since mid-2001

>200 K entries in Swiss-Prot

UniProt

raw data

Curated automated annotationCurated automated annotation

TrEMBL TrEMBL ??

SwissProt SwissProt annotationannotation

UniProt/TrEMBLUniProt/TrEMBL

Redundancy

Automatically maintained

• Automatic clean-up of nucleotide data

• Automatic annotation

• InterPro run and cross-references updated every 2 weeks

Recognises common annotation in related Swiss-Prot entries

Identifies all members of family using InterPro

SwissProt SwissProt annotated annotated sequencessequences

uncharacterised

Multiple Multiple signaturessignatures INTERPROINTERPRO

provides provides annotation on annotation on multiple levelsmultiple levels

Feeds back to Feeds back to TrEMBLTrEMBL

Curated Annotation in InterProCurated Annotation in InterPro

Entry name uses accession number

Automatic annotation through machine learning

Foundations of InterProFoundations of InterPro

Manual curation

Integration of signatures

InterPro

• Greater coverage of proteins

• Relationships between signatures

• Signature databases specialised

greater coverage of annotation features

evolutionary context

Unique to InterProUnique to InterPro

Advantages of integrated signaturesAdvantages of integrated signatures

Characterisation of Protein SequencesCharacterisation of Protein Sequences

Build up consensus sequences of families, domains, motifs or sites Conserved signatures

more sequences

BLAST

Basic information

Finding Conserved SignaturesFinding Conserved Signatures

• Pattern

More information

Simplest (limited)

• Profile

• Fingerprint

• Sequence clustering

• HMM

PatternsPatterns

Patterns in sequence regular expressions

Often used to define important sites within proteins

PROSITE best-known pattern database

PatternsPatterns

Example: PS00262 Insulin family signature

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN

INS_HUMAN




INS_HUMAN

PatternsPatterns

PatternsPatterns




INS_HUMAN

PatternsPatterns




INS_HUMAN

PatternsPatterns



INS_HUMAN

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQ CCTSICSLYQLENYC N

PatternsPatterns



MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQ CCTSICSLYQLENYC N

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Regular expression

Extract pattern sequencesxxxxxxxxxxxxxxxxxxxxxxxx

Sequence alignment

Insulin family motifDefine pattern

Pattern signature

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Build regular expression

FingerprintsFingerprints

Several discrete motifs characterise family

Highly specific matches to small regions of proteins

PRINTS best-known fingerprint database


Example: PR00107 Phosphocarrier HPr signature

MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV DGADEAEGMA

AIVETLQKEG LAE

PTHP_ENTFA:



AIVETLQKEG LAE

His phosphorylation site


PTHP_ENTFA:



AIVETLQKEG LAE


Ser phosphorylation site


PTHP_ENTFA:



Conserved site


MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV

DGADEAEGMA AIVETLQKEG LAE

PTHP_ENTFA:

Ser phosphorylation site



MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV

DGADEAEGMA AIVETLQKEG LAE

1) GIHARPATLLVQTASKF

2) KGKSVNLKSIMGVMSL

3) LGVGQGSDVTITVDGADE

PR00107 a fingerprint with three motifs

PTHP_ENTFA:

Extract motif sequences

xxxxxxxxxxxxxxxxxxxxxxxx



Sequence alignment

Fingerprint signature 1 2 3

Correct order

Correct spacing

Ser phosphorylation

site

Conserved site

His phosphorylation

siteDefine motifs

Sequence ClusteringSequence Clustering

Automatic clustering of homologous domains

Used by ProDom database

Sequence ClusteringSequence Clustering

Well-characterised domain families

Align resulting protein domain families

ProDomAlign

Automatically cluster homologous domains

MKDOM2

Recruit homologous domains

PSI-BLAST

ProfilesProfiles

Sequence alignment scoring matrix

Profile

Sequence search

Matrix

(frequency of each residue at each position in alignment)

Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:

Sequence alignment

Match values are higher for conserved residues

e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)











ProfilesProfiles

Problem insertions and deletions not well accounted for

Can characterise proteins over entire length (need trusted sequence alignment)

Position-specific scoring good for modelling divergent as well as conserved regions

Hidden Markov Models (HMM)Hidden Markov Models (HMM)

Large scale profiles

Outperform in sensitivity and specificity

More flexible (can use partial alignments)

• Probability method gauges scoring parameters

• Allows insertions and deletions

Improvements:


Sequence alignment

M1 M2 M3 M4Begin

End

M = match state


D3

I2 I3

M1 M2 M3 M4Begin

End

D1 D4

M = match state,

D2

D = delete state

I1 I4

I = insert state,

I0


HMMbuild

Database search

HMMcalibrate

HMMER2 package:

http://hmmer.wustl.edu/


HMM databases:

• PIR SUPERFAMILY

• PANTHER

• TIGRFAM

• PFAM

• SMART

• SUPERFAMILY

• GENE3D

Domains conserved in sequence

Families conserved in sequence

Domains conserved in structure


HMM databases:

• PIR SUPERFAMILY

• PANTHER

• TIGRFAM

• PFAM

• SMART

• SUPERFAMILYSUPERFAMILY

• GENE3DGENE3D

Domains conserved in sequence

Families conserved in sequence

Domains conserved in structure

Special Special casecase

SAM Profile HMMsSAM Profile HMMs

(http://www.cse.ucsc.edu/research/compbio/sam.html)

SUPERFAMILY + GENE3D

• Start with single seed sequence

SAM:

• Proteins related by structure

• Uses Target99 (T99) script

Often only 1 protein in a family with structural

information

May have low sequence identity

Combine results

Multiple models/ superfamily

• Homologous Structural Superfamilies

SAM T99 Profile HMMsSAM T99 Profile HMMs

T99 script:

Low identity matches

Close homologues

WU-BLASTP

search

Final HMM

Single seed sequenceGIHARPATLLVQTASKF

Initial HMM

GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

New larger alignmentGIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF


Extract motif pattern (PROSITE)

Single motif method

Multiple motif methods

Full alignment methods

Extract multiple motifs (PRINTS)





Full sequence:

1) profile (PROSITE)

2) HMM (PFAM, SMART, SUPERFAMILY, TIGRFAM, PIRSF, GENE3D, PANTHER)

Sequence alignment

Summary of signature methodsSummary of signature methods

Patterns Prosite

Fingerprints Prints

Sequence clustering ProDom

Profiles PrositeHMM PIR Superfamily Panther

Tigrfam Pfam

Smart

Protein Signature DatabasesProtein Signature Databases

T99-SAM HMM Gene3D Superfamily

PrintsPrints

Fingerprint is a set of motifs

Full length of protein

PR00000

Can identify small conserved regions in divergent proteins

Use different combinations of motifs to describe families and sibling subfamilies

http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/

Prosite PatternsProsite Patterns

Pattern is a regular expression

PS00000

Identify various important sites within proteins

Several models characterise enzymes

Used by UniProt to define catalytic sites

Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding

http://us.expasy.org/prosite/

Prosite ProfilesProsite Profiles

ProfilePatternPS00000

PS00000

Describe protein families or domains conserved in sequence

Use curated sequence alignments

Accurate

Profile is a multiple alignment with matrix frequencies

http://us.expasy.org/prosite/

ProDomProDom

Sequence clustering method automatic process (mkdom2)

PD000000

Groups UniProt sequences into (core) domains conserved in sequence

http://protein.toulouse.inra.fr/prodom/current/html/home.php

PfamPfam

HMM models built from HMMER2

PF00000

Pfam A manually curatedPfam B automatic clustering

Use trusted cut-offs accurate

Wide coverage of protein families and domains conserved in sequence

http://www.sanger.ac.uk/Software/Pfam/

Only PFAM A used to build signatures in

InterPro

SmartSmart

HMM domains using curated sequence alignments of families from psi-blast

SM00000

Primarily describe domains conserved in sequence

Concentrate on signalling proteins, and extracellular and nuclear domains

http://smart.embl-heidelberg.de/

TigrfamsTigrfams

HMM families built with curated alignments

TIGR00000

Describe protein families (and domains) conserved in sequence and function

Functional classifications using equivalogs(functionally conserved homologues)

Curated trusted cut-off Very accurateUse phylogenetic trees Accurate family

membershiphttp://www.tigr.org/TIGRFAMs/

PIRSFPIRSF

http://pir.georgetown.edu/pirsf/

HMM families using computationally defined non-overlapping clusters of sequences

PIRSF000000

Comprehensive protein family database of full-length models

Describe protein families conserved in sequence and domain composition:

Homeomorphic

PantherPanther

https://panther.appliedbiosystems.com/

HMM families based on phylogenetic trees

PTHR00000

Comprehensive protein family database of full-length models

Provides family classification by functions, processes, pathways and taxonomy

Use phylogenetic trees Define functionally distinct families

SuperfamilySuperfamily

HMMs based on SCOP structural superfamilies

Describe protein domains conserved in structure with evidence of common evolutionary origin

Provides information on structural classification

http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/

Good at describing non-contiguous structural domains

SSF00000

Often define structural domain boundaries

Gene3DGene3D

HMM domains based on CATH structural superfamily

G3D.0.0.0.0

Provides information on structural classification

http://cathwww.biochem.ucl.ac.uk/latest/index.html

Describe protein domains conserved in structure with evidence of common evolutionary origin

Always define structural domain boundaries

**

Good at describing non-contiguous structural domains

PrintsPrints Describe sibling families

PrositeProsite Identify binding and active sites (enzymes)

ProDomProDom Describe conserved core of domains

PfamPfam Wide coverage of domains and families

SmartSmart Signalling, extracellular & nuclear domains

TigrfamTigrfam Functional classification of equivalogs

PIRSFPIRSF Homeomorphs, conserved in domain composition

PantherPanther Functional families; best at detecting fragments

SuperfamilySuperfamily Structural-based domain classification

Gene3DGene3D Describe structural domain boundaries

Specialisation of databasesSpecialisation of databases

Structural Representation in InterProStructural Representation in InterPro

MSD

PDB sequence

UniProt amino acid position

Residue-by-residuemapping

InterPro sequence-structure

comparison

PDB structures displayed as striped patterns

Structural classification in CATHCATH

SCOP

and SCOP

Homology models from Swiss-model

Swiss-M

and ModBase

ModB

Structural RepresentationStructural Representation

Structural RepresentationStructural Representation

CATH and SCOP divide PDB structures into domains

Swiss-Model and ModBase predict structure for regions not covered by PDB

Note that one domain is non-contiguous

Sequence-Structure DisplaySequence-Structure Display

Structural data for specific

proteins

Signatures predictive of

protein annotation

http://www.ebi.ac.uk/interpro/

Search tools include:

• Text Search

• InterProScan (sequence search)

• SRS (multiple database search)

Searching InterProSearching InterPro

Text Text Search Search ResultsResults

Direct links to entry

InterProScan search resultsInterProScan search results

Link to InterPro entry

Link to SRS view of InterPro entry

Enables direct searching of other databases in SRS

using InterProScan results

Link to signature database

Mouse-over provides signature data: residue position, E-value, accession ID, and name

Single InterPro

entry

InterPro EntryInterPro Entry

• Groups similar signatures together and provide relationships between signatures

• Provides extensive manual annotation

• Provides links to other databases

• Provides structural information and viewers

• Name and short name• Entry type• Relationships• GO mapping• Abstract• Structural links• Database links• Taxonomy• Examples• Publications

Annotation Fields in InterProAnnotation Fields in InterPro

InterPro entry for the ligand-binding InterPro entry for the ligand-binding domain of the nuclear hormone domain of the nuclear hormone

receptorreceptor

Protein matches

Shows the InterPro entries

that match a protein

Protein matchesShows each individual

signature that matches a protein

Shows structural information for

protein with links to PDB, CATH,

SCOP

Protein matches

Protein matches

Splice variants

Select data set of these proteins

Detailed information

Family, domain, site, repeat

Links to signature databases

Relationships linking different

signatures

Mapping to GO terms

Abstract with references

Contains/Found inContains/Found inDescribe composition of protein sequences

Parent/ChildParent/ChildFamily or domain evolutionary hierarchies

Structural links

Database links

Taxonomy

Overlap with other InterPro entries

Examples

References

Integration of signatures

Greater coverage of annotation features

Relationships provide evolutionary context (unique to InterPro)

Increased coverage of proteins

Enhances functional annotation of

TrEMBL

Powerful Annotation ToolPowerful Annotation Tool

Database links

Taxonomy Search/download using taxonomy

GO mapping Large-scale classification using GO terms

To several databases to increase annotation

Structural information Structural classification, 3-D viewers

Signature databases Direct links to their annotation

Powerful Annotation ToolPowerful Annotation Tool

InterPro signatures cover:

90% of UniProt/Swiss-Prot proteins

69% of UniProt/TrEMBL proteins

CoverageCoverage

>2 million matches in InterPro>2 million matches in InterPro

>13,000 InterPro entries>13,000 InterPro entries

>22,000 signature methods>22,000 signature methods

Structural coverage in InterPro:

0.6% of proteins have PDB structures

20% of proteins have Swiss-Model structures

63% of proteins have ModBase structures

CoverageCoverage

>9500 PDB structures in InterPro>9500 PDB structures in InterPro

>300,000 Swiss Model links in InterPro>300,000 Swiss Model links in InterPro

>950,000 ModBase links in InterPro>950,000 ModBase links in InterPro

Web accessWeb access

Tool/Databases:

Availability and downloadsAvailability and downloads

ftp://ftp.ebi.ac.uk/pub/databases/ftp site:

DownloadsDownloads


2Can Training and Education2Can Training and Education

Bioinformatics Educational ResourceBioinformatics Educational Resource

Information on EBI Databases

On-line tutorials on EBI Databases and tools

Glossary

Guide to bioinformatics resources on the internet

EBI web servicesProtein structureNucleotide analysis

Proteomics analysis

Protein function

Genome browsing Database browsing

http://www.ebi.ac.uk/

http://www.ebi.ac.uk/2can/

http://www.ebi.ac.uk/interpro/

Rolf Apweiler

Amos Bairoch

Cathy Wu

+100 annotators

AcknowledgementsAcknowledgements

Nicky Mulder

IntAct Team

InterPro Consortium

Henning Hermajakob

InterPro Team

Documents

A Proteomics Toolkit: