Biomolecular databases
Bioinformatics
Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique
Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/
NEW ADDRESS (since Nov 1st, 2011) [email protected]
Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics
(TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/
B!GRe Bioinformatique des
Génomes et Réseaux
!"#$%&'&()#*'+*,-*%#".+/&0+("%&1)#.+*%,+#')%)#.!"#$Inserm U1090
Contents
! Examples of biological databases " Nucleic sequences: Genbank, EMBL, and DDBJ " Protein sequences: UniProt " The Gene Ontology (GO) project
! Issues and perspectives for biological databases
Examples of biomolecular databases
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Examples of biomolecular databases
! Sequence and structure databases " Protein sequences (UniProt) " DNA sequences (EMBL, Genbank, DDBJ) " 3D structures (PDB) " Structural motifs (CATH) " Sequence motifs (PROSITE, PRODOM)
! Genome sequences and annotations " Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, !) " Multiple genomes (Integr8, NCBI, KEGG, TIGR, !)
! Molecular functions " Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB) " Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA) " Transport (YTPdb)
! Biological processes " Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation) " Signal transduction pathways (CSNdb, Transpath) " Protein-protein interactions (DIP, BIND, MINT) " Gene networks (GeneNet, FlyNets)
Databases of databases
! There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.
! Every year, the first issue of Nucleic Acids Research is dedicated to biological databases
" http://nar.oupjournals.org/ " 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1
! The same journal maintains a database of databases: the Molecular Biology Database Collection
" http://www.oxfordjournals.org/nar/database/c/ ! Some bioinformatics centres maintain multiple database, with cross-links
between them. The SRS server at EBI holds an impressive collection of databases.
" http://srs.ebi.ac.uk/
Nucleic sequence databases: GenBank, EMBL, and DDBJ
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Okubo et al. (2006) NAR 34: D6-D9
Nucleic sequence databases
! To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.
! There are 3 main repositories for nucleic acid sequences. ! Sequences deposited in any of these 3 databases are automatically
synchronized in the 2 other ones.
Adapted from Didier Gonze
The sequencing pace ! Nucleic sequences
" Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ • 126,551,501,141 bases in 135,440,924 sequence records in the
traditional GenBank divisions • 191,401,393,188 bases in 62,715,288 sequence records in the
Whole Genome Ssequencing ! Entire genomes
" GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.
" http://www.genomesonline.org/gold_statistics.htm
! Protein sequences " Essentially obtained by translation of putative genes in nucleic
sequences (almost no direct protein sequencing). " UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences. " http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html !Class entries nucleotides!------------------------------------------------------------------!CON:Constructed 7,236,371 359,112,791,043!EST:Expressed Sequence Tag 73,715,376 40,997,082,803!GSS:Genome Sequence Scan 34,528,104 21,985,922,905!HTC:High Throughput CDNA sequencing 491,770 594,229,662!HTG:High Throughput Genome sequencing 152,599 25,159,746,658!PAT:Patents 24,364,832 12,117,896,594!STD:Standard 13,920,617 37,665,112,606!STS:Sequence Tagged Site 1,322,570 636,037,867!TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279!WGS:Whole Genome Shotgun 88,288,431 305,661,696,545! ----------- ---------------!Total 252,106,363 450,481,663,919!!Division entries nucleotides!------------------------------------------------------------------!ENV:Environmental Samples 30,908,230 14,420,391,278!FUN:Fungi 6,522,586 11,614,472,226!HUM:Human 32,094,500 38,072,362,804!INV:Invertebrates 31,907,138 52,527,673,643!MAM:Other Mammals 40,012,731 145,678,620,711!MUS:Mus musculus 11,745,671 19,701,637,499!PHG:Bacteriophage 8,511 85,549,111!PLN:Plants 52,428,994 55,570,452,118!PRO:Prokaryotes 2,808,489 28,807,572,238!ROD:Rodents 6,554,012 33,326,106,733!SYN:Synthetic 4,045,013 782,174,055!TGN:Transgenic 285,307 849,743,891!UNC:Unclassified 8,617,225 4,957,442,673!VRL:Viruses 1,358,528 1,518,575,082!VRT:Other Vertebrates 22,809,428 42,568,889,857! ----------- ---------------!Total 252,106,363 450,481,663,919!
Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/
The EMBL Nucleotide Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/
DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
URL Sequences
Bases (without shotgun)
bases (including shotgun) Organisms
DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09EMBL http://www.ebi.ac.uk/embl/ 1.0E+11 2.0E+05GenBank http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10 1.0E+11 2.1E+05
Size of the nucleic sequence databases
! Summary of database contents for the 3 main databases of nucleic sequences. ! Source: NAR database issue January 2006.
UniProt : protein sequences and functional annotations
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
UniProt - the Universal Protein Resource http://www.uniprot.org/ ! Database content (Sept 2012)
" UniProtKB: • 24,532,088 entries • Translation of EMBL coding sequences
(non-redundant with Swiss-Prot) " UniProtKB/Swiss-Prot section (reviewed):
• 537,505 entries • annotation by experts • high information content • many references to the literature • good reliability of the information
" The rest (90% of the entries) • Automatic annotation by sequence
similarity. ! Features
" The most comprehensive protein database in the world.
" A huge team: >100 annotators + developers. " Annotation by experts: annotators are
specialized for different types of proteins or organisms.
" World-wide recognized as an essential resource.
! References " Bairoch et al. The SWISS-PROT protein
sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9
" The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.
Number of entries (polypeptides) in Swiss-Prot
http://www.expasy.org/sprot/relnotes/relstat.html
Taxonomic distribution of the sequences
Within Eukaryotes
UniProt example - Human Pax-6 protein Header : name and synonyms
UniProt example - Human Pax-6 protein Human-based annotation by specialists
UniProt example - Human Pax-6 protein Structured annotation : keywords and Gene Ontology terms
UniProt example - Human Pax-6 protein Protein interactions; Alternative products
UniProt example - Human Pax-6 protein Detailed description of regions, variations, and secondary structure
UniProt example - Human Pax-6 protein Peptidic sequence
UniProt example - Human Pax-6 protein References to original publications
UniProt example - Human Pax-6 protein Cross-references to many databases (fragment shown)
3D Structure of macromolecules
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
PDB - The Protein Data Bank http://www.rcsb.org/pdb/
Genome browsers
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/
UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/
Human gene Pax6 aligned with Vertebrate genomes
UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/
Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/
Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
ECR Browser http://ecrbrowser.dcode.org/
EnsEMBL - Example: Drosophila gene Pax6 http://www.ensembl.org/
Comparative genomics
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Integr8 - access to complete genomes and proteomes http://www.ebi.ac.uk/integr8/
Integr8 - genome summaries http://www.ebi.ac.uk/integr8/
Integr8 - clusters of orthologous genes (COGs) http://www.ebi.ac.uk/integr8/
Integr8 - clusters of paralogous genes http://www.ebi.ac.uk/integr8/
Databases of protein domains
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/
Prosite - aligned sequences and logo http://www.expasy.ch/prosite/
! Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048).
! The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.
! Note the 6 cysteines, characteristic of this domain.
Prosite - Example of profile matrix http://www.expasy.ch/prosite/
Prosite - Example of sequence logo http://www.expasy.ch/prosite/
Prosite - Example of domain signature http://www.expasy.ch/prosite/
! The domain signature is a string-based pattern representing the residues that are characteristic of a domain.
PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/ Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
CATH - Protein Structure Classification http://www.cathdb.info/
! CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:
" Class (C), " Architecture (A), " Topology (T) " Homologous superfamily (H).
! The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.
! References " Orengo et al. The CATH Database
provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9
" Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.
CATH - Protein Structure Classification http://www.cathdb.info/
InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/
InterPro (EBI - UK) Antennapedia-like Homeobox (entry IPR001827)
The Gene Ontology (GO) database
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Ontology definition
! Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières
! Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinations Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993!
The "bio-ontologies"
! Answer to the problem of inconsistencies in the annotations " Controlled vocabulary " Hierarchical classification between the terms of the controlled vocabulary
! E.g.: The Gene Ontology " molecular function ontology " process ontology " cellular component ontology
Gene ontology: processes
Gene ontology: molecular functions Gene ontology: cellular components
Gene Ontology Database http://www.geneontology.org/
Gene Ontology Database (http://www.geneontology.org/)
Example: methionine biosynthetic process
Status of GO annotations (NAR DB issue 2006)
! Term definitions " Biological process terms 9,805 " Molecular function terms 7,076 " Cellular component terms 1,574 " Sequence Ontology terms 963
! Genomes with annotation 30 " Excludes annotations from UniProt, which represent 261 annotated proteomes.
! Annotated gene products " Total 1,618,739 " Electronic only 1,460,632 " Manually curated 158,107
QuickGO (http://www.ebi.ac.uk/QuickGO/)
! Web site http://www.ebi.ac.uk/QuickGO/
! A user-friendly Web interface to the Gene Ontology.
! Graphical display of the hierarchical relationships between terms.
! Convenient browsing between classes.
Remarks on "bio-ontologies"
! Improvement compared to free text " controlled vocabulary (choice among synonyms) " hierarchical relationships between the concepts
! Nothing to do with the philosophical concept of ontology " A "bio-ontologies" is usually nothing more than a taxonomical classification of
the terms of a controlled vocabulary ! Multiple possibilities of classification criteria
" e.g. compartment subtypes (plasma membrane is a membrane) " e.g. compartment locations (nucleus is inside cytoplasm is inside plasma
membrane) ! To be useful, should remain purpose-based
" each biologist might wish to define his/her own classification based on his/her needs and scope of interest
" impossible to define a unifying standard for all biologists ! No representation of molecular interactions
" relationships between objects are only hierarchical, not horizontal or cyclic " e.g. does not describe which genes are the target of a given transcription
factor
What is biological function ?
! A general definition " Fonction: action, rôle caractéristique d’un élément, d’un organe, dans un ensemble
(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.
" Function: characteristic action (role) of an element (organ) within an set (often opposed to structure)
! Function and gene ontology " Understanding the function requires to establish the link between molecular activity
and the context in which it takes place (process). " Multifunctionality
• Same activity can play different roles in different processes. ! Example: scute gene in Drosophila melanogaster: a transcription factor
(activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes).
• Multiple activities of a same protein in a given process ! Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic
domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).
Small compounds, reactions and metabolic pathways
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
LIGAND - Small compounds and metabolic reactions
KEGG - Kyoto Encycplopaedia of Genes and Genomes Ecocyc, BioCyc and Metacyc - Metabolic pathways
Protein interaction networks and transduction pathways
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Microarray databases
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Human genome resources
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
HapMap http://www.hapmap.org/
! The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.
! Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.
Issues for biomolecular databases
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Issues for biological databases
! Dealing with biological complexity ! Data content
" Coverage " Information content
! Data quality " Data structure " Consistency
! Query capabilities ! Interfaces
" User interfaces " Programmatic interfaces
! Annotation ! Funding
Towards biological complexity
! The main databases currently available are focussed on one type of molecular entity : nucleic sequences, proteins, compounds, !
! This type of organization is very convenient as far as the information to be represented is simple (e.g. DNA sequences, structures of small molecules and macromolecules).
! It becomes more difficult if we want to represent " the interactions between biological objects, " the integration of various elements in a biological process (metabolic pathways, protein
interaction networks, regulatory networks, !) " complex concepts such as ”biological function”
Data content
! Scope of the database " types of biological objects represented
! Number of entries " coverage of the current knowledge
! Information content " Level of detail in the description of the biological objects
! References to the source of information
Data quality
! Data Consistency " always use the same name to indicate the same object " (this seems trivial, but its is unfortunately still not always the case) " event better: define an ID for each objects, and allow to retrieve it by any of its
synonyms " spelling mistakes
! Data Structuration " distinct fields for distinct attributes of the biological objects
! Reliability " Evidences ? Level of confidence ? " Assignation of function by similarity
• recursive process ! propagation of errors
Query capabilities
! Browsing (click and read) ! Simple search
" select records with some constraints ! More elaborate search
" select specific fields of some records with constraints on some fields (~SQL SELECT)
! Complex querying " ability to return an answer that results from a "live" computation, and was not part
of any record of the dabatase
Interfaces
! User interfaces " user-friendly " convenient browsing " intuitive query forms " visualization (graphical output)
! Programmatic interfaces " communication with external programs:
• other databases (concept of distributed database) • analysis tools
Annotation
! Problem " The flow of available data is increasing exponentially
! Strategies " internal curators " selected external experts " public submission " computer-based extraction of information from biological texts
Funding
! Public funding " Problem: easier to obtain public funds for creating a new database than for
maintaining or expanding existing resources ! Private funding
" Industrial companies are • ready to invest in good data and good query capabilities • interested by academic expertise
! Solutions " All users pay (per query for example)
• Note: academic users are anyway funded by public funds " Hybrid solution
• access is free for academic users, not for companies • companies can buy the whole database an install it in-house
(+ add their own private data) • academia-industry interface is often ensured by a spinoff company