Upload
nonnie
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A Proteomics Toolkit:. UniProt, InterPro and IntAct Databases at the EBI. Hinxton,U.K. EMBL. GenBank. EBI (EMBL). NCBI (NIH). DDBJ. CIB (NIG). European Bioinformatics Institute. (http://www.ebi.ac.uk/). Created as part of the EMBL in 1992 - PowerPoint PPT Presentation
Citation preview
A Proteomics A Proteomics Toolkit:Toolkit:
UniProt, InterPro and IntAct UniProt, InterPro and IntAct Databases at the EBIDatabases at the EBI
Hinxton,U.K.
European Bioinformatics InstituteEuropean Bioinformatics Institute
Created as part of the EMBL in 1992
• To house EMBL Nucleotide Sequence Data Library established in 1980
Today, 3 databases accept primary nucleotide data:
(http://www.ebi.ac.uk/)
EBI (EMBL)EBI (EMBL) EMBL
CIB (NIG)CIB (NIG)
DDBJ
NCBI (NIH)NCBI (NIH)GenBank
EMBL-EBI EMBL-EBI maintains the maintains the world’s most world’s most
comprehensive comprehensive range of range of
molecular molecular databasesdatabases
European Bioinformatics InstituteEuropean Bioinformatics Institute(http://www.ebi.ac.uk/)
Nucleotide Sequence Database
Database of Protein Families and Domains
ArrayExpress
Alternative Splicing Database
Protein Sequence Database
Molecular Structure Database
Alternative Transcript Diversity
Automatic Annotation of Genomes
Protein Interaction Database
Chemical Entities of Biological
Interest
Gene Ontology
Enzyme Database
Database of Biological Processes
http://www.ebi.ac.uk/services/
Roles of Public Domain DatabasesRoles of Public Domain Databases
To provide stable, long-term sources of basic information
To react in the long-term for the needs of the community
To act as repositories for published information
To bridge the gap between multiple data sources
Protein DatabasesProtein Databases
UniProtUniProt Database of Protein Sequences
InterPro InterPro Database of Protein Families and Domains
IntAct IntAct Database of Protein Interactions
World's most comprehensive catalogue of information on proteins
Funded mainly by NIH
A central repository of protein sequence and function
Based on the original work of PIR, Swiss-Prot and TrEMBL
UniProtUniProt
Met-Gln-Pro-Glu-Glu-Gly-Thr-Gly-Trp-Leu-Leu-Glu-Val-Gln-Gln-
Met-Gly-Arg-Gly-Arg-Cys-Val-Gly-Pro-Ser-Leu-Gln-Glu-Trp-Arg-
protein sequencingprotein sequencing
annotationannotation Swiss-Prot
EMBL
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCTGTGATAGCGCTGATCGTGATGCGTATGCAGGTCGT
nucleotide sequencingnucleotide sequencing
Swiss-Prot
EMBL
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCTGTGATAGCGCTGATCGTGATGCGTATGCAGGTCGT
nucleotide sequencingnucleotide sequencing
TrEMBL
translated EMBLtranslated EMBL
annotationannotation
UniProUniPrott
PSD
PIRPIR
annotation
++
EBIEBI
UniProt Consortium
UniProtUniProt
UniProt Reference ClustersUniProt Reference Clusters (UniRef)
UniProt KnowledgebaseUniProt Knowledgebase (UniProt)
UniProt ArchiveUniProt Archive (UniParc)
3 Components:3 Components:
UniProtUniProt
UniProt KnowledgebaseUniProt Knowledgebase (UniProt)
3 Components:3 Components:
UniProt Reference ClustersUniProt Reference Clusters (UniRef)
UniProt ArchiveUniProt Archive (UniParc)
• Central repository for annotated protein sequences
UniProtUniProt
UniProt KnowledgebaseUniProt Knowledgebase (UniProt)
UniProt ArchiveUniProt Archive (UniParc)
3 Components:3 Components:
• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated
• Central repository for annotated protein sequences
UniProt Reference ClustersUniProt Reference Clusters (UniRef)
UniProtUniProt
UniProt KnowledgebaseUniProt Knowledgebase (UniProt)
3 Components:3 Components:
• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated
UniProt ArchiveUniProt Archive (UniParc)
• Central repository for annotated protein sequences
UniProt Reference ClustersUniProt Reference Clusters (UniRef)• Combines related sequences for speed searching
UniProtUniProt
UniProt KnowledgebaseUniProt Knowledgebase (UniProt)
3 Components:3 Components:
• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated
• Central repository for annotated protein sequences
UniProt Reference ClustersUniProt Reference Clusters (UniRef)• Combines related sequences for speed searching• UniRef100, UniRef90, UniRef50
UniProt ArchiveUniProt Archive (UniParc)
UniProtUniProt
UniProt Reference ClustersUniProt Reference Clusters (UniRef)
UniProt KnowledgebaseUniProt Knowledgebase (UniProt)
UniProt ArchiveUniProt Archive (UniParc)
3 Components:3 Components:
• Combines related sequences for speed searching
• Comprehensive repository for history of sequences
• Central repository for annotated protein sequences• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated
• UniRef100, UniRef90, UniRef50
UniProt Explicit Links
SequenceEMBL/GenBank/DDBJPIR
PTM GlycoSuiteDBPhosSite
StructureHSSPPDBMSD
Domains, Sites, FamiliesGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTTIGRFAM
2D-gel ElectrophoresisANU-2DPAGEAarhus/Ghent-2DPAGECOMPLUYEAST-2DPAGEECO2DPAGEHSC-2DPAGEMAIZE-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGESiena-2DPAGESWISS-2DPAGE
Molecular InteractionIntActTRANSFAC
DatabasesDatabasescross-referencedcross-referenced
in UniProtin UniProt
MiscellaneousEnsemblGermOnlineGene OntologyMEROPS
Organism-SpecificAGDdbSNPDictyBaseEcoGeneEchoBASEFlyBaseGeneDB_SpombeGeneFarmGenewGrameneHIVH-InvDBLegioListLepromaListiListMaizeDBMGDMypuListOMIMPhotoListReactomeRGDSagaListSGDStyGeneSubtiListTAIRTIGRTubercuListWormBaseWormPepZFIN
http://www.ebi.ac.uk/services/
Search tools include:
• Text Search
http://www.ebi.uniprot.org/index.shtml
• Blast, Fasta and MPsrch
• Links to extra search services (including SRS)
• Power Search
Searching UniProtSearching UniProt
http://www.ebi.uniprot.org/index.shtml
• Text-based searching• Logical operators ‘&’ (and), ‘|’ (or) • (Wildcards and numerical operators not allowed)
• Text Search – keyword queries• Power Search – can search for specific entry lines• Warehouse Search – link query to other databases
Text Search ResultsText Search Results
Each linked to the UniProt entry
• Sequence-based searching• BLAST, Fasta, MPsrch
Sequence Search ResultsSequence Search Results
UniProt entry
Identity score
View alignments
Manipulate multiple data sets
Use Venn diagrams to combine, intersect, or
subtract multiple data sets
Build complex data sets
UniProt/Swiss-Prot entry for UniProt/Swiss-Prot entry for human ubiquitin-protein ligase E3 human ubiquitin-protein ligase E3
mdm2mdm2
Some literature search engines pull
synonyms from UniProt for more
complete searching
Merged entries:• Remove redundancy• Can still be searched
IntAct Database
Summary of nucleotide data
upon which entry is originally basedStructural data associated with entry protein
IntAct Database
IntAct Database
All the interactions with
entry protein
IntAct Database
IntAct Database
IntAct Database
Experimental information
Experimental name Experimental technique:
co-immunoprecipitation
Literature citation used for curationTaxonomic Reference
Interaction information
Links to interacting protein
IntAct Database
Displays interactions graphically
View all 7 interactions involving MDM2
View all GO interactions involving MDM2
View all InterPro entries associated with MDM2
Expand graph to see network surrounding one protein
Expand graph to see entire network
View interactions associated with both MDM2 and p53
View all proteins in a network associated with a specific GO term
All protein in red associated with “negative regulation of cell proliferation”
Genomic location
Complete nucleotide sequence
SNP information
Transcript and protein information
Transcript structure
Interactive map. Can zoom in/out, and move around
Summary and links to information about processes involving this molecule (here
cell-cycle checkpoints)GeneralSpecific
Mendelian Inheritance in Man
Cellular componentMolecular functionBiological process
InterPro Database
• Allow searching for terms• Linked to GO
Domain organisation
Position of motifs and sites
Positions of variable splicing
Experimental mutation information
Sequencing conflicts
Secondary structure
Easy navigation between UniProt/UniParc/UniRef
Useful for cut/paste into search engines
UniProt/TrEMBLUniProt/TrEMBL
>2.5 M entries in TrEMBL
Doubled since mid-2004 Doubled since mid-2001
>200 K entries in Swiss-Prot
UniProt
raw data
Curated automated annotationCurated automated annotation
TrEMBL TrEMBL ??
SwissProt SwissProt annotationannotation
UniProt/TrEMBLUniProt/TrEMBL
Redundancy
Automatically maintained
• Automatic clean-up of nucleotide data
• Automatic annotation
• InterPro run and cross-references updated every 2 weeks
Recognises common annotation in related Swiss-Prot entries
Identifies all members of family using InterPro
SwissProt SwissProt annotated annotated sequencessequences
uncharacterised
Multiple Multiple signaturessignatures INTERPROINTERPRO
provides provides annotation on annotation on multiple levelsmultiple levels
Feeds back to Feeds back to TrEMBLTrEMBL
Curated Annotation in InterProCurated Annotation in InterPro
Entry name uses accession number
Automatic annotation through machine learning
Foundations of InterProFoundations of InterPro
Manual curation
Integration of signatures
InterPro
• Greater coverage of proteins
• Relationships between signatures
• Signature databases specialised
greater coverage of annotation features
evolutionary context
Unique to InterProUnique to InterPro
Advantages of integrated signaturesAdvantages of integrated signatures
Characterisation of Protein SequencesCharacterisation of Protein Sequences
Build up consensus sequences of families, domains, motifs or sites Conserved signatures
more sequences
BLAST
Basic information
Finding Conserved SignaturesFinding Conserved Signatures
• Pattern
More information
Simplest (limited)
• Profile
• Fingerprint
• Sequence clustering
• HMM
PatternsPatterns
Patterns in sequence regular expressions
Often used to define important sites within proteins
PROSITE best-known pattern database
PatternsPatterns
Example: PS00262 Insulin family signature
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |
MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN
INS_HUMAN
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |
Example: PS00262 Insulin family signature
MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN
INS_HUMAN
PatternsPatterns
PatternsPatterns
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |
Example: PS00262 Insulin family signature
MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN
INS_HUMAN
PatternsPatterns
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |
Example: PS00262 Insulin family signature
MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN
INS_HUMAN
PatternsPatterns
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |
Example: PS00262 Insulin family signature
INS_HUMAN
MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQ CCTSICSLYQLENYC N
PatternsPatterns
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |
Example: PS00262 Insulin family signature
MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQ CCTSICSLYQLENYC N
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Regular expression
Extract pattern sequencesxxxxxxxxxxxxxxxxxxxxxxxx
Sequence alignment
Insulin family motifDefine pattern
Pattern signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Build regular expression
FingerprintsFingerprints
Several discrete motifs characterise family
Highly specific matches to small regions of proteins
PRINTS best-known fingerprint database
FingerprintsFingerprints
Example: PR00107 Phosphocarrier HPr signature
MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV DGADEAEGMA
AIVETLQKEG LAE
PTHP_ENTFA:
FingerprintsFingerprints
MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV DGADEAEGMA
AIVETLQKEG LAE
His phosphorylation site
Example: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
FingerprintsFingerprints
MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV DGADEAEGMA
AIVETLQKEG LAE
His phosphorylation site
Ser phosphorylation site
Example: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
FingerprintsFingerprints
His phosphorylation site
Conserved site
Example: PR00107 Phosphocarrier HPr signature
MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV
DGADEAEGMA AIVETLQKEG LAE
PTHP_ENTFA:
Ser phosphorylation site
FingerprintsFingerprints
Example: PR00107 Phosphocarrier HPr signature
MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV
DGADEAEGMA AIVETLQKEG LAE
1) GIHARPATLLVQTASKF
2) KGKSVNLKSIMGVMSL
3) LGVGQGSDVTITVDGADE
PR00107 a fingerprint with three motifs
PTHP_ENTFA:
Extract motif sequences
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
Sequence alignment
Fingerprint signature 1 2 3
Correct order
Correct spacing
Ser phosphorylation
site
Conserved site
His phosphorylation
siteDefine motifs
Sequence ClusteringSequence Clustering
Automatic clustering of homologous domains
Used by ProDom database
Sequence ClusteringSequence Clustering
Well-characterised domain families
Align resulting protein domain families
ProDomAlign
Automatically cluster homologous domains
MKDOM2
Recruit homologous domains
PSI-BLAST
ProfilesProfiles
Sequence alignment scoring matrix
Profile
Sequence search
Matrix
(frequency of each residue at each position in alignment)
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
Sequence alignment
Match values are higher for conserved residues
e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
Match values are higher for conserved residues
e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
Match values are higher for conserved residues
e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
Match values are higher for conserved residues
e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
ProfilesProfiles
Problem insertions and deletions not well accounted for
Can characterise proteins over entire length (need trusted sequence alignment)
Position-specific scoring good for modelling divergent as well as conserved regions
Hidden Markov Models (HMM)Hidden Markov Models (HMM)
Large scale profiles
Outperform in sensitivity and specificity
More flexible (can use partial alignments)
• Probability method gauges scoring parameters
• Allows insertions and deletions
Improvements:
Hidden Markov Models (HMM)Hidden Markov Models (HMM)
Sequence alignment
M1 M2 M3 M4Begin
End
M = match state
Hidden Markov Models (HMM)Hidden Markov Models (HMM)
D3
I2 I3
M1 M2 M3 M4Begin
End
D1 D4
M = match state,
D2
D = delete state
I1 I4
I = insert state,
I0
Hidden Markov Models (HMM)Hidden Markov Models (HMM)
HMMbuild
Database search
HMMcalibrate
HMMER2 package:
http://hmmer.wustl.edu/
Hidden Markov Models (HMM)Hidden Markov Models (HMM)
HMM databases:
• PIR SUPERFAMILY
• PANTHER
• TIGRFAM
• PFAM
• SMART
• SUPERFAMILY
• GENE3D
Domains conserved in sequence
Families conserved in sequence
Domains conserved in structure
Hidden Markov Models (HMM)Hidden Markov Models (HMM)
HMM databases:
• PIR SUPERFAMILY
• PANTHER
• TIGRFAM
• PFAM
• SMART
• SUPERFAMILYSUPERFAMILY
• GENE3DGENE3D
Domains conserved in sequence
Families conserved in sequence
Domains conserved in structure
Special Special casecase
SAM Profile HMMsSAM Profile HMMs
(http://www.cse.ucsc.edu/research/compbio/sam.html)
SUPERFAMILY + GENE3D
• Start with single seed sequence
SAM:
• Proteins related by structure
• Uses Target99 (T99) script
Often only 1 protein in a family with structural
information
May have low sequence identity
Combine results
Multiple models/ superfamily
• Homologous Structural Superfamilies
SAM T99 Profile HMMsSAM T99 Profile HMMs
T99 script:
Low identity matches
Close homologues
WU-BLASTP
search
Final HMM
Single seed sequenceGIHARPATLLVQTASKF
Initial HMM
GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF
New larger alignmentGIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF
xxxxxxxxxxxxxxxxxxxxxxxx
Extract motif pattern (PROSITE)
Single motif method
Multiple motif methods
Full alignment methods
Extract multiple motifs (PRINTS)
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
Full sequence:
1) profile (PROSITE)
2) HMM (PFAM, SMART, SUPERFAMILY, TIGRFAM, PIRSF, GENE3D, PANTHER)
Sequence alignment
Summary of signature methodsSummary of signature methods
Patterns Prosite
Fingerprints Prints
Sequence clustering ProDom
Profiles PrositeHMM PIR Superfamily Panther
Tigrfam Pfam
Smart
Protein Signature DatabasesProtein Signature Databases
T99-SAM HMM Gene3D Superfamily
PrintsPrints
Fingerprint is a set of motifs
Full length of protein
PR00000
Can identify small conserved regions in divergent proteins
Use different combinations of motifs to describe families and sibling subfamilies
http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/
Prosite PatternsProsite Patterns
Pattern is a regular expression
PS00000
Identify various important sites within proteins
Several models characterise enzymes
Used by UniProt to define catalytic sites
Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding
http://us.expasy.org/prosite/
Prosite ProfilesProsite Profiles
ProfilePatternPS00000
PS00000
Describe protein families or domains conserved in sequence
Use curated sequence alignments
Accurate
Profile is a multiple alignment with matrix frequencies
http://us.expasy.org/prosite/
ProDomProDom
Sequence clustering method automatic process (mkdom2)
PD000000
Groups UniProt sequences into (core) domains conserved in sequence
http://protein.toulouse.inra.fr/prodom/current/html/home.php
PfamPfam
HMM models built from HMMER2
PF00000
Pfam A manually curatedPfam B automatic clustering
Use trusted cut-offs accurate
Wide coverage of protein families and domains conserved in sequence
http://www.sanger.ac.uk/Software/Pfam/
Only PFAM A used to build signatures in
InterPro
SmartSmart
HMM domains using curated sequence alignments of families from psi-blast
SM00000
Primarily describe domains conserved in sequence
Concentrate on signalling proteins, and extracellular and nuclear domains
http://smart.embl-heidelberg.de/
TigrfamsTigrfams
HMM families built with curated alignments
TIGR00000
Describe protein families (and domains) conserved in sequence and function
Functional classifications using equivalogs(functionally conserved homologues)
Curated trusted cut-off Very accurateUse phylogenetic trees Accurate family
membershiphttp://www.tigr.org/TIGRFAMs/
PIRSFPIRSF
http://pir.georgetown.edu/pirsf/
HMM families using computationally defined non-overlapping clusters of sequences
PIRSF000000
Comprehensive protein family database of full-length models
Describe protein families conserved in sequence and domain composition:
Homeomorphic
PantherPanther
https://panther.appliedbiosystems.com/
HMM families based on phylogenetic trees
PTHR00000
Comprehensive protein family database of full-length models
Provides family classification by functions, processes, pathways and taxonomy
Use phylogenetic trees Define functionally distinct families
SuperfamilySuperfamily
HMMs based on SCOP structural superfamilies
Describe protein domains conserved in structure with evidence of common evolutionary origin
Provides information on structural classification
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/
Good at describing non-contiguous structural domains
SSF00000
Often define structural domain boundaries
Gene3DGene3D
HMM domains based on CATH structural superfamily
G3D.0.0.0.0
Provides information on structural classification
http://cathwww.biochem.ucl.ac.uk/latest/index.html
Describe protein domains conserved in structure with evidence of common evolutionary origin
Always define structural domain boundaries
**
Good at describing non-contiguous structural domains
PrintsPrints Describe sibling families
PrositeProsite Identify binding and active sites (enzymes)
ProDomProDom Describe conserved core of domains
PfamPfam Wide coverage of domains and families
SmartSmart Signalling, extracellular & nuclear domains
TigrfamTigrfam Functional classification of equivalogs
PIRSFPIRSF Homeomorphs, conserved in domain composition
PantherPanther Functional families; best at detecting fragments
SuperfamilySuperfamily Structural-based domain classification
Gene3DGene3D Describe structural domain boundaries
Specialisation of databasesSpecialisation of databases
Structural Representation in InterProStructural Representation in InterPro
MSD
PDB sequence
UniProt amino acid position
Residue-by-residuemapping
InterPro sequence-structure
comparison
PDB structures displayed as striped patterns
Structural classification in CATHCATH
SCOP
and SCOP
Homology models from Swiss-model
Swiss-M
and ModBase
ModB
Structural RepresentationStructural Representation
Structural RepresentationStructural Representation
CATH and SCOP divide PDB structures into domains
Swiss-Model and ModBase predict structure for regions not covered by PDB
Note that one domain is non-contiguous
Sequence-Structure DisplaySequence-Structure Display
Structural data for specific
proteins
Signatures predictive of
protein annotation
http://www.ebi.ac.uk/interpro/
Search tools include:
• Text Search
• InterProScan (sequence search)
• SRS (multiple database search)
Searching InterProSearching InterPro
Text Text Search Search ResultsResults
Direct links to entry
InterProScan search resultsInterProScan search results
Link to InterPro entry
Link to SRS view of InterPro entry
Enables direct searching of other databases in SRS
using InterProScan results
Link to signature database
Mouse-over provides signature data: residue position, E-value, accession ID, and name
Single InterPro
entry
InterPro EntryInterPro Entry
• Groups similar signatures together and provide relationships between signatures
• Provides extensive manual annotation
• Provides links to other databases
• Provides structural information and viewers
• Name and short name• Entry type• Relationships• GO mapping• Abstract• Structural links• Database links• Taxonomy• Examples• Publications
Annotation Fields in InterProAnnotation Fields in InterPro
InterPro entry for the ligand-binding InterPro entry for the ligand-binding domain of the nuclear hormone domain of the nuclear hormone
receptorreceptor
Protein matches
Shows the InterPro entries
that match a protein
Protein matchesShows each individual
signature that matches a protein
Shows structural information for
protein with links to PDB, CATH,
SCOP
Protein matches
Protein matches
Splice variants
Select data set of these proteins
Detailed information
Family, domain, site, repeat
Links to signature databases
Relationships linking different
signatures
Mapping to GO terms
Abstract with references
Contains/Found inContains/Found inDescribe composition of protein sequences
Parent/ChildParent/ChildFamily or domain evolutionary hierarchies
Structural links
Database links
Taxonomy
Overlap with other InterPro entries
Examples
References
Integration of signatures
Greater coverage of annotation features
Relationships provide evolutionary context (unique to InterPro)
Increased coverage of proteins
Enhances functional annotation of
TrEMBL
Powerful Annotation ToolPowerful Annotation Tool
Database links
Taxonomy Search/download using taxonomy
GO mapping Large-scale classification using GO terms
To several databases to increase annotation
Structural information Structural classification, 3-D viewers
Signature databases Direct links to their annotation
Powerful Annotation ToolPowerful Annotation Tool
InterPro signatures cover:
90% of UniProt/Swiss-Prot proteins
69% of UniProt/TrEMBL proteins
CoverageCoverage
>2 million matches in InterPro>2 million matches in InterPro
>13,000 InterPro entries>13,000 InterPro entries
>22,000 signature methods>22,000 signature methods
Structural coverage in InterPro:
0.6% of proteins have PDB structures
20% of proteins have Swiss-Model structures
63% of proteins have ModBase structures
CoverageCoverage
>9500 PDB structures in InterPro>9500 PDB structures in InterPro
>300,000 Swiss Model links in InterPro>300,000 Swiss Model links in InterPro
>950,000 ModBase links in InterPro>950,000 ModBase links in InterPro
Web accessWeb access
Tool/Databases:
Availability and downloadsAvailability and downloads
ftp://ftp.ebi.ac.uk/pub/databases/ftp site:
DownloadsDownloads
http://www.ebi.ac.uk/services/
2Can Training and Education2Can Training and Education
Bioinformatics Educational ResourceBioinformatics Educational Resource
Information on EBI Databases
On-line tutorials on EBI Databases and tools
Glossary
Guide to bioinformatics resources on the internet
EBI web servicesProtein structureNucleotide analysis
Proteomics analysis
Protein function
Genome browsing Database browsing
http://www.ebi.ac.uk/
http://www.ebi.ac.uk/2can/
http://www.ebi.ac.uk/interpro/
Rolf Apweiler
Amos Bairoch
Cathy Wu
+100 annotators
AcknowledgementsAcknowledgements
Nicky Mulder
IntAct Team
InterPro Consortium
Henning Hermajakob
InterPro Team