View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Ulf Schmitz, Data exchange standards and ontologies 1
www. .uni-rostock.de
Systems Biology Systems Biology Data exchange standards and ontologiesData exchange standards and ontologies
Systems Biology and Bioinformatics Groupwww.sbi.informatik.uni-rostock.de
Ulf Schmitz, Data exchange standards and ontologies 2
www. .uni-rostock.de
Outline
1. The need for data exchange formats
2. Standards and de facto standards in SB
3. Why XML as framework?
4. Ontolgies
5. OWL, RDF, OBO. Portege
6. Minimum Information Requied – suggestions
7. Standards for graphical representation
8. Outlook
Ulf Schmitz, Data exchange standards and ontologies 3
www. .uni-rostock.de
The need for data exchange formats
• Rapid increase in experimental data (high throughput)• Quick comparison, analysis and integration of that data is required• This results in a need for standardized formats for representation of those results
Number of entries in EMBL (current 63,713,453)
Ulf Schmitz, Data exchange standards and ontologies 4
www. .uni-rostock.de
The need for data exchange formats
[Brazma, 2006]
•setup•protocol•results
Ulf Schmitz, Data exchange standards and ontologies 5
www. .uni-rostock.de
The need for data exchange formats
there is a demand in standardized formats for:• experimental data
– for reproducibility of experiments– annotation in DBs and use in data analysis tools
• standardized names for metabolites, reactions and enzymes
• mathematical models– description, accessibility and exchange
• standardized graphical representation of networks– like with electronic circuits
[Klipp, Survey]
Ulf Schmitz, Data exchange standards and ontologies 6
www. .uni-rostock.de
Standards and de facto standards in SB
Name Ver. Purpose Tools Data
SBMLSystems Biology Markup Language
2.2format for representing models of biochemical reaction networks
Supported by over 100 software systems
Data available from many databases, (e.g. KEGG, Reactome,JWS, Biomodels)
PSI MI Proteomics Standards Initiative - Molecular Interaction
2.5standard for data representation of protein-protein interactions
Tools for viewing and analysisDatasets available from many sources, for instance IntAct, DIP, BIND
BioPAXBiological Pathways Exchange
2 format for biological pathway data Existing tools for OWL such as Protégé
Datasets available from Reactome
CellML 1.1Supports the definition of models of cellular and subcellular processes
Tools for publication, visualization, creation and simulation
CellML Model Repository (~240 models) www.cellml.org
CMLChemical Markup Language
2.2Interchange of chemical information (atomic, molecular and crystallographic information, compounds, structure, publications)
Molecular browsers, editors BioCYC www.biocyc.org
EMBLxml 1.0 nucleotide sequence information API support in BioJavaX EMBL www.ebi.ac.uk/embl
MathML 2.0For the representation of mathematical formulars
Browsers http://www.w3.org/Math
Totally more than 80 standards within Systems Biology
Ulf Schmitz, Data exchange standards and ontologies 7
www. .uni-rostock.de
Standards and de facto standards in SB
Name Ver. Purpose Tools Data
INSD-seqInternational Nucleotide Sequence Database
Collaboration
1.4 representation for sequence records API support in BioJavaX EMBL and GenBank
Seq-entry n/aNCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences.
SRI’s BioWarehouse and Protein Structure Factory’s ORFer
Entrez
BSMLBioinformatic Sequence Markup Language
3.1Facilitate the interchange of data for more efficient communication within the life sciences community
LabBook’s Genomic Browser and Sequence Viewer
Converters Previously provided by EMBL
HUP-MLHuman Proteome Markup Language
0.8 markup language for proteome data HUP-ML Editor
MAGE-MLMicroArray and Gene Expression
1.1 Microarray Gene Expression Data ConvertersArrayExpress
www.ebi.ac.uk/arrayexpress
MzXML
mass spectrometric 2.1 common file format for MassSpec data Converters, viewers
PeptideAtlas, Sashimi, Open Proteomics Database
AGMLAnnotated Gel Markup Language
2.0To model the concept of annotated gel (AG) for 2-DE results
Visualizer AGML
Ulf Schmitz, Data exchange standards and ontologies 8
www. .uni-rostock.de
XML based standards
• It’s very user friendly: – browsing the web is almost instinctive and requires minimal training time
• very easy to learn the language: – knowledge of a few self-explanatory tag names and understanding of a very simple syntax is
enough to write good web pages
• but HTML has limits, cause it’s basically dedicated to human browsing– it has a static structure and does’nt privide semantic features, not differentiating between different
data types
• Final:– XML is an extensible and easy to use format for information representation in biological
applications
In biology, HTML is used for data publishing, database browsing, data gathering, data submission and analysis
Ulf Schmitz, Data exchange standards and ontologies 9
www. .uni-rostock.de
Why XML as framework?
• The eXtensible Markup Language (XML) is derived from SGML (Standard Generalized Markup Language)
– the international standard for defining descriptions of the structure and contents of different types of electronic documents
• XML is an emerging standard for structuring documents, notably for the World Wide Web
– XML allows the definition of a set of tags to be applied to one or many documents– these tags define elements in the document
• XML bases standards have found to be most useful as a data language for bioinformatics
– for data interchange between databases and other sources of data• this goes in hand with the development of ontologies
[Archard et.al., 2000]Archard et.al., 2000]
Ulf Schmitz, Data exchange standards and ontologies 10
www. .uni-rostock.de
XML
XML documents consist of elements, that are textual data structured by tags
An element consits of a Start/End tag pair, some optional/mandatory attributes defined as key/value pairs and the data between those tags
Ulf Schmitz, Data exchange standards and ontologies 11
www. .uni-rostock.de
XML Pros and Cons
• Pros:– XML is highly flexible
– human readable
– internet oriented, has rich capabilities of linking data, useful for interconnecting databases
– provides an open framework for defining standard specifications
• Cons:– overhead of text bases data formats in data parsing, storage and transmission
– source can be read an edited with any editor
– expressiveness of the XML data model would probable not be sufficient for molecular biology
Alternative formats for the management and exchange of bioinformatics data
• Flat Files (e.g. flat file libraries from EMBL, GenBank, DDBJ or Swiss-Prot)• ASN.1 Abstract Syntax Notation One (used at the NCBI for exporting GenBank data)• COBRA The Common Object Request Broker Architecture• JAVA RMI Remote Method Invocations• OODBMS Object oriented Database Management System
Ulf Schmitz, Data exchange standards and ontologies 12
www. .uni-rostock.de
XML based ontology languages• Ontology
– A system for describing knowledge, a conceptualization of a domain of interest usually made up of any or all of the following: concepts (classes), relations, attributes, constraints, objects, values.
• RDF– Resource Description Framework, a proposed W3C
standard, allows description of basic relationships between objects (subject-predicate-object semantics).
• OWL– Web ontology language, a proposed W3C standard,
is an extension of RDF to support ontologies. It provides semantics for classes and subclasses, instances, and relationships.
• OBO– Open Biomedical Ontologies (OBO) Foundry is a
collaborative experiment: to produce well-structured vocabularies introduces a new paradigm for biomedical ontology development
• Protégé– Protégé ontology and knowledge base editor. A
software tool to build an ontology and manage instances of classes defined in that ontology.
Ulf Schmitz, Data exchange standards and ontologies 16
www. .uni-rostock.de
Ontologies
Domain Prefix Files Format
Biological imaging methods FBbi image.obo OBO
Biological process GO gene ontology.obo OBO
Cell type CL cell.obo OBO
Cellular component GO gene ontology.obo OBO
Drosophila development FBdv fly development.obo OBO
Event (INOH pathway ontology) IEV event.obo OBO
Evidence codes ECO evidence code.obo OBO
eVOC (Expressed Sequence Annotation for Humans)
EV evoc.obo.tar (v2.7) OBO
FlyBase Controlled Vocabulary FBcv flybase controlled vocabulary.obo OBO
Human disease DOID human disease.obo OBO
Ulf Schmitz, Data exchange standards and ontologies 17
www. .uni-rostock.de
Ontologies
Domain Prefix Files Format
Mammalian phenotype MP mammalian phenotype.obo OBO
MESH MESH mesh.obo OBO
Microarray experimental conditions
MO MGEDOntology.owl OWL
Molecular function GO gene ontology.obo OBO
Multiple alignment RO mao.obo OBO
NCBI organismal classification taxon taxonomy.dat plain text
OBO relationship types OBO_REL ro.obo OBO
Pathway ontology PW pathway.obo OBO
Protein domain IPR InterPro FTP directory http://www.w3.org/XML/
Protein-protein interaction MI psi-mi.obo OBO
Proteomics data and process provenance
ProPreO ProPreO.owl OWL
Sequence types and features SO so.obo OBO
Systems Biology SBO SBO_OWL.owl OWL
UniProt taxonomy Organism identification code list plain text
Ulf Schmitz, Data exchange standards and ontologies 18
www. .uni-rostock.deNew standards defining the minimal required contents
• MIAME – Minimum Information About a Microarray Experiment• MIAPE – Minimum Information About a Proteomics Experiment• MIRIAM – Minimum Information Requested In the Annotation of biochemical Models
One common suggestion among these requirements is to store metadata according to the controlled vocabulary (in ontologies) instead of free text
Other requirements are:• information about participating substances• Organisms• Literature references
Ulf Schmitz, Data exchange standards and ontologies 19
www. .uni-rostock.deMIRIAM - Minimum Information Requested In the Annotation of biochemical Models
• many of the published models in biology are lost for the community because they are either not made available or they are insufficiently characterized to allow them to be reused
• the lack of a standard description format, lack of stringent reviewing and authors’ carelessness are the main cause for incomplete model descriptions
• quantitative models will be useful only if their access and reuse is made easy for all scientists
• rules for creating quantitative models of biological systems:– use standardized, structured formats for encoding biological models (SBML, CellML)– annotate models on public repositories (Biomodels Database, Sigpath, EcoCyc, CellML
repository, JWS Online, RegulonDB, DOQCS)– the model when instantiated whithin a suitable simulation environment, must be able to produce
all relevantresults given in the reference description– annotations to be included in model (use CellML metadata or SBML simple annotation scheme):
• preferred name of the model• citation of the reference description• Name and contact information for the model creators• date and time of creation• a precise statement about the terms of distribution (‘public domain’, ‘copyrighted’, ‘freely
distributable’,’confidential’
[Novere 2005]
SBML validator
Ulf Schmitz, Data exchange standards and ontologies 20
www. .uni-rostock.de
MIRIAM example model
through standardization of the model curation process, it will be possible to create resources that are as significant to systems biology as resources like Ensembl are to genomics
Ulf Schmitz, Data exchange standards and ontologies 22
www. .uni-rostock.de
Survey about standards in SB
[Klipp 2005]
Ulf Schmitz, Data exchange standards and ontologies 23
www. .uni-rostock.de
Standards for graphical representation
• CellDesigner – often used tool to visualize biochemical reaction networks
• SBGN – Systems Biology Graphical Notation– Attempt to develop standards for graphical representation
• Molecular Interaction Map (Kohn maps)
there is a need for a graphical formalism that covers fundamental biochemical processes and that can be uniquely mapped
1. to mathematical objects such as ordinary differential equations (ODE) or stochastic simulation schemes, and
2. to a textual description.
Ulf Schmitz, Data exchange standards and ontologies 24
www. .uni-rostock.deA.Funahashi and H. Kitano modification of Kohn maps
A
A
State transition – changes the state of modification rather than activation
Activation
Inhibition
Translocation of module
Dashes line indicates active state of a molecule
Specific state of molecular species
Notation of the process diagram
Ulf Schmitz, Data exchange standards and ontologies 25
www. .uni-rostock.de
Molecular Interaction Maps (MIM)
• Characteristics:– Each molecule shown only in one location
• All interactions and modifications can be traced from one point• Molecules can be located from an index of map coordinates
– In “Cell Cycle eMIMs” (interactive MIMs) molecules serve as links to additional sources of information (PubMed, Gene Cards, MedMiner)
Ulf Schmitz, Data exchange standards and ontologies 26
www. .uni-rostock.de
Symbols and conventions used in eMIMs
A B
A B
C
Ph’tase
A
A
X
Y
Protein A and B can bind to each otherThe node represents the A:B complex
Multimolecular complex: x is A:B; y is (A:B):CEndless extendable
Reactions:
P
P
A B
Covalent modification of protein A. A can exist in a phosphorylated state.
Cleavage of a covalent bond: dephosphorylation of A by a phosphatase.
Stoichiometric conversion of A to B.
Ulf Schmitz, Data exchange standards and ontologies 27
www. .uni-rostock.de
Symbols and conventions used in eMIMs
A
A
Reactions:
Cytosol Nucleus
Contingencies:
Transport of A from cytosol to nucleus. The dot represents A after transport to the nucleus.
Formation of homodimer. Dot on the right represents copy of A. Dot on line represents the homodimer A:A
Enzymatic stimulation of a reaction
Enzymatic of a reaction in trans.
Stimulation of a process. Bar indicates necessity.
Inhibition
Transcriptional activation
Transcriptional inhibition
Ulf Schmitz, Data exchange standards and ontologies 28
www. .uni-rostock.de
Molecular Interaction Map (eMIM)
Ulf Schmitz, Data exchange standards and ontologies 29
www. .uni-rostock.de
Take home message
• while developing/using tools consider which data exchange formats it should be able to handle (import/export)
• while doing experiments, consider annotating them with an apropriate data exchange format to make it reusable/reproducable for others)
• defining a new data exchange format keep existing ontologies in mind helping to find a common vocabulary among the community
• while communicating your scientific procedures and results, try to observe if there is a common language used among you and your collaborators or if there is a need of common vocabulary defined in an ontology? (Don’t hasitate to create one with the help of Portege)
Ulf Schmitz, Data exchange standards and ontologies 30
www. .uni-rostock.de
Literature
• Strömbäck, L. and Hall D. and Lambrix P.: A review of standards for data exchange within systems biology. Proteomics 2007, 7, 857–867
• Achard, F. and Vaysseix, G. and Barillot, E.: XML, bioinformatics and data integration. Bioinformatics 2000, 17, 2, 115-125
• Brazma, A. and Krestyaninova, M. and Sarkans, U.: Standards for systems biology. NATURE REVIEWS GENETICS, 2006, 7, 593-605
• Novere, N. et. al.: Minimum information requested in the annotation of biochemial models (MIRIAM). Nature Biotechnology, 2005, 23, 12, 1509-1115
• Klipp, E. and Liebermeister, W. and Helbig, A. and Kowald, A. and Schaber, J.: Standards in Computational Systems Biologie. 2005
Ulf Schmitz, Data exchange standards and ontologies 31
www. .uni-rostock.de
Data exchange standards and ontologies
Thanx for your attention!!!