1EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Added-Value Proteome Databases: SWISS-PROT, TrEMBL, InterPro
2EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Large-Scale Characterization of Protein Sequence Data: The Integrative
Approach of SWISS-PROT + TrEMBL
3EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Times are changing
4EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
‘Data Waves’
Biological sequences Mutation Metabolism Polymorphism Signaling Expression
Size Complexity Integration
5EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
The Challenge of the Genome Era
rapidly growing amounts of data lacking experimental determination of the biological function enhances the need for computational analyses of the data
6EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Need for Bioinformatics
7EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Bioinformatics: 5 years ago.....
Pharmaceutical companies were not interested Life scientists believed that it was an outlet for
failed biologists who like to play with computers Computer scientists did not even know of its
existence
8EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Bioinformatics: today.....
Pharmaceutical companies believe that it is a way to streamline the drug discovery process
Some life scientists believe that it is the solution to all problems in life sciences
Computer scientists find it most useful as a new way to get grants
9EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Bioinformatics: In 5 years.....
Pharmaceutical companies use it routinely complementary to experimental work
Life scientists use it efficiently and therefore forget that it exists
Computer scientists have jumped on another hot subject
10EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Bioinformatics
is a complement but no substitute of experimental research: it can help to plan experiments, but not replace experiments
is not cheap takes a significant amount of time to be any good Quality control is crucial: Some garbage in, a lot
of garbage out!
11EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Materials and Methods
Materials: biological data Methods: a wide range of computational
techniques
12EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Essential in Bioinformatics: Databases as a tool for
computational analysis and data-mining
(with SWISS-PROT being the gold-standard)
13EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
SWISS-PROT is a curated protein sequence data bank
established in July 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since June 1987
contains currently 76 000 protein sequence entries
14EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Essential criteria for a sequence data bank
it must be complete with minimal redundancy it must contain as much up-to-date information as
possible on each sequence all the information items must be retrievable by
computer programs in a consistent manner it should be integrated (cross-referenced) with
other sequence related data banks
15EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Integration with other databases
76 000 SWISS-PROT entries abstracted from > 60 000 references linked by > 275 000 direct pointers to 30 related
or specialized data collections
16EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Integration with other databases
EMBL Nucleotide Sequence Database PDB Genomic databases (FlyBase, SubtiList, MaizeDB,
EcoGene, LISTA, SGD, StyGene) 2D-Gel databases (ECO2DBASE, SWISS-
2DPAGE, Aarhus/Ghent, YEPD, Harefield) Specialized collections (OMIM, PROSITE,
ENZYME, GCRDB, Transfac, HSSP)
17EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Connections between databasesEMBL Nucleotide
Sequences
SWISS-PROTProtein Sequences
WormPep[C. elegans]
EPD [Euk. Prom]
FlyBase
SubtiList
MaiseDb
EcoGene [E. coli]
LISTA [Yeast]
Transfac
GCRDb [7TM recep.]
REBASE[RestrictionEnzymes]
StyGene[S. typhimurium]
Prosite[Patterns]
ECD [E. coli map]
SWISS-2DPAGE [2D]
Aarhus/Ghent [2D]
ECO2DBASE [2D]
ENZYME [Nomencl.]
DictyDB [D.disco.]
OMIM [Diseases]
YEPD [yeast]
HSSP [3 simil.]PDB [3D structures]
18EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
SWISS-PROT Growth.
0
5
10
15
20
25
Am
ino A
cid
s (M
illion
s)
87 88 89 90 91 92 93 94 95 96
Year
19EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Nucleotide sequence database growth
.
0
200
400
600
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96Year
Meg
ab
ase
s
20EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
The Bottleneck: Annotation
21EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Annotation consists of the description of:
Function(s) of the protein Post-translational modification(s) Domains and sites Secondary structure Quaternary structure Similarities to other proteins Disease(s) associated with deficiencie(s) in the protein Sequence conflicts, variants, etc.
22EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Annotation sources:
publications that report new sequence data review articles to periodically update the
annotation of families or groups of proteins external experts
23EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
TrEMBL
is a Computer-annotated supplement to SWISS-PROT
consists of entries in SWISS-PROT format translations of CDS in the Nucleotide Sequence
Database not in SWISS-PROT
24EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
August 1998: SWISS-PROT 36 + TrEMBL 7
327 000 CDS in corresponding EMBL release
74 000 SWISS-PROT entries 109 000 CDS integrated in SWISS-PROT the remaining 216 000 CDS were merged
whenever possible to reduce redundancy
25EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
TrEMBL release 7
194 000 TrEMBL entries 54 000 000 amino acids linked by > 300 000 direct pointers to 14 related or specialized data collections
26EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
The Production of TrEMBL
translation and entry creation sorting the entries post-processing the SP-TrEMBL entries
27EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Translation and entry creation
translation of every CDS not yet cross-referenced to SWISS-PROT
parsing of information in EMBL entries into TrEMBL entries
28EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Sorting the entries
into SP-TrEMBL and REM-TrEMBL SP-TrEMBL is split in taxonomic divisions
29EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Post-processing
reducing redundancy enhancing the information content
30EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Improving AutomaticAnnotation will streamline flow
into TrEMBL will bring TrEMBL
nearer to SWISS-PROT quality
will make the transition from TrEMBL to SWISS-PROT easier
Hands-onCuration
Removal ofredundancy
PROSITE patternSearching
Enhancement
ReliableProsite
Matches
EnzymeNumbers
SP-TREMBL
SWISS-PROT
Hot Spot forDevelopment
31EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Demands on a system for automated data analysis and annotation
Correctness Scalability Updateable Low level of redundant information Completeness Standardized vocabulary
32EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Standardized transfer of annotation from characterized proteins in SWISS-PROT
to TrEMBL entries TrEMBL entry is reliably recognized by a given
method as a member of a certain group of proteins
corresponding group of proteins in SWISS-PROT shares certain annotation
common annotation is transferred to the TrEMBL entry and flagged as annotated by similarity
33EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Environment for Distributed Information Transfer to TrEMBL
(EDITtoTrEMBL)
RuleBase Analyzers Dispatchers
34EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
EDITtoTrEMBL
35EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
EDITtoTrEMBL: RuleBase
SWISS-PROT as source of annotation: correctness and controlled vocabulary
Rules can be semi-automatically and/or manually created
Rules can be updated
36EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
EDITtoTrEMBL: Analyzers
Directly implement an algorithm or communicate with external programs
Query other databases Use rules to add information to TrEMBL entries
37EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
EDITtoTrEMBL: Examples of Analyzers
sequence analysis tools (PROSITE, PFAM, PRINTS, TM, Coiled Coils, Signal etc)
sequence similarity searching (FASTA, SW, BLAST)
database scanning/parsing (MGD, FlyBase, ENZYME, etc)
38EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
EDITtoTrEMBL: Dispatchers
Control of annotation flow Error checking Removal of redundant information
39EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Automated post-processing of TrEMBL entries
redundancy removal: affects currently around 20% of the entries
improvements of annotation: affects currently around 25% of the entries
40EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
SWISS-PROT + TrEMBL
complete and up-to-date protein sequence collection
minimal redundancy: SP_TR_NRDB linked by > 500 000 direct pointers to 30
related or specialized data collections deeper integration between the EMBL
Nucleotide Sequence Database and SWISS-PROT + TrEMBL by using PID numbers
41EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Integrated resource of Protein domain and functional sites
(InterPro) Integration of different pattern recognition
methods (PROSITE, PRINTS and PFAM) Incorporation of new families and domains into
InterPro Enhancing the functional annotation of TrEMBL
entries Enhancing genome annotation
42EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
The InterPro project participants Co-ordinated by EBI (R. Apweiler) PROSITE (A. Bairoch, P. Bucher) PRINTS (T. Attwood) PFAM (R. Durbin, E. Birney, A. Bateman, E. Sonnhammer) PRODOM (D. Kahn) PRATT (I. Jonassen) GENE-IT (J.-J. Codani) LION bioscience AG (R. Schneider)
43EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
1.9.1998: SWISS-PROT ceased
to be in the public domain
44EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
What has changed
No changes for academic users Almost no restrictions on the redistribution of
SWISS-PROT by academic servers or software companies
Commercial users are required to pay yearly subscription fees. These fees will be used to complement the existing grants in order to provide stable long-term funding
45EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
CreditsSWISS-PROT at EBI Rolf Apweiler Sergio Contrino Wolfgang Fleischmann Gill Fraser Henning Hermjakob Viv Junker Alexander Kanapin Youla Karavidopoulou Evguenia Kriventseva Fiona Lang Claire O'Donovan Michele Magrane Maria Jesus Martin Nicoletta Mitaritonna Steffen Moeller Evgenui Zdobnov
Collaborators Amos Bairoch Jean-Jacques Codani Keith Tipton Marvin Edelman Compugen Paracel Sue Povey and Julia White MGD Flybase Neil Rawlings Network of > 200 external experts