34
Ulf Schmitz, Data exchange standards and ontologies 1 www. .uni-rostock. Systems Biology Systems Biology Data exchange standards and ontologies Data exchange standards and ontologies Ulf Schmitz [email protected] Systems Biology and Bioinformatics Group www.sbi.informatik.uni-rostock.de

Www..uni-rostock.de Ulf Schmitz, Data exchange standards and ontologies1 Systems Biology Data exchange standards and ontologies Ulf Schmitz [email protected]

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Ulf Schmitz, Data exchange standards and ontologies 1

www. .uni-rostock.de

Systems Biology Systems Biology Data exchange standards and ontologiesData exchange standards and ontologies

Ulf [email protected]

Systems Biology and Bioinformatics Groupwww.sbi.informatik.uni-rostock.de

Ulf Schmitz, Data exchange standards and ontologies 2

www. .uni-rostock.de

Outline

1. The need for data exchange formats

2. Standards and de facto standards in SB

3. Why XML as framework?

4. Ontolgies

5. OWL, RDF, OBO. Portege

6. Minimum Information Requied – suggestions

7. Standards for graphical representation

8. Outlook

Ulf Schmitz, Data exchange standards and ontologies 3

www. .uni-rostock.de

The need for data exchange formats

• Rapid increase in experimental data (high throughput)• Quick comparison, analysis and integration of that data is required• This results in a need for standardized formats for representation of those results

Number of entries in EMBL (current 63,713,453)

Ulf Schmitz, Data exchange standards and ontologies 4

www. .uni-rostock.de

The need for data exchange formats

[Brazma, 2006]

•setup•protocol•results

Ulf Schmitz, Data exchange standards and ontologies 5

www. .uni-rostock.de

The need for data exchange formats

there is a demand in standardized formats for:• experimental data

– for reproducibility of experiments– annotation in DBs and use in data analysis tools

• standardized names for metabolites, reactions and enzymes

• mathematical models– description, accessibility and exchange

• standardized graphical representation of networks– like with electronic circuits

[Klipp, Survey]

Ulf Schmitz, Data exchange standards and ontologies 6

www. .uni-rostock.de

Standards and de facto standards in SB

Name Ver. Purpose Tools Data

SBMLSystems Biology Markup Language

2.2format for representing models of biochemical reaction networks

Supported by over 100 software systems

Data available from many databases, (e.g. KEGG, Reactome,JWS, Biomodels)

PSI MI Proteomics Standards Initiative - Molecular Interaction

2.5standard for data representation of protein-protein interactions

Tools for viewing and analysisDatasets available from many sources, for instance IntAct, DIP, BIND

BioPAXBiological Pathways Exchange

2 format for biological pathway data Existing tools for OWL such as Protégé

Datasets available from Reactome

CellML 1.1Supports the definition of models of cellular and subcellular processes

Tools for publication, visualization, creation and simulation

CellML Model Repository (~240 models) www.cellml.org

CMLChemical Markup Language

2.2Interchange of chemical information (atomic, molecular and crystallographic information, compounds, structure, publications)

Molecular browsers, editors BioCYC www.biocyc.org

EMBLxml 1.0 nucleotide sequence information API support in BioJavaX EMBL www.ebi.ac.uk/embl

MathML 2.0For the representation of mathematical formulars

Browsers http://www.w3.org/Math

Totally more than 80 standards within Systems Biology

Ulf Schmitz, Data exchange standards and ontologies 7

www. .uni-rostock.de

Standards and de facto standards in SB

Name Ver. Purpose Tools Data

INSD-seqInternational Nucleotide Sequence Database

Collaboration

1.4 representation for sequence records API support in BioJavaX EMBL and GenBank

Seq-entry n/aNCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences.

SRI’s BioWarehouse and Protein Structure Factory’s ORFer

Entrez

BSMLBioinformatic Sequence Markup Language

3.1Facilitate the interchange of data for more efficient communication within the life sciences community

LabBook’s Genomic Browser and Sequence Viewer

Converters Previously provided by EMBL

HUP-MLHuman Proteome Markup Language

0.8 markup language for proteome data HUP-ML Editor

MAGE-MLMicroArray and Gene Expression

1.1 Microarray Gene Expression Data ConvertersArrayExpress

www.ebi.ac.uk/arrayexpress

MzXML

mass spectrometric 2.1 common file format for MassSpec data Converters, viewers

PeptideAtlas, Sashimi, Open Proteomics Database

AGMLAnnotated Gel Markup Language

2.0To model the concept of annotated gel (AG) for 2-DE results

Visualizer AGML

Ulf Schmitz, Data exchange standards and ontologies 8

www. .uni-rostock.de

XML based standards

• It’s very user friendly: – browsing the web is almost instinctive and requires minimal training time

• very easy to learn the language: – knowledge of a few self-explanatory tag names and understanding of a very simple syntax is

enough to write good web pages

• but HTML has limits, cause it’s basically dedicated to human browsing– it has a static structure and does’nt privide semantic features, not differentiating between different

data types

• Final:– XML is an extensible and easy to use format for information representation in biological

applications

In biology, HTML is used for data publishing, database browsing, data gathering, data submission and analysis

Ulf Schmitz, Data exchange standards and ontologies 9

www. .uni-rostock.de

Why XML as framework?

• The eXtensible Markup Language (XML) is derived from SGML (Standard Generalized Markup Language)

– the international standard for defining descriptions of the structure and contents of different types of electronic documents

• XML is an emerging standard for structuring documents, notably for the World Wide Web

– XML allows the definition of a set of tags to be applied to one or many documents– these tags define elements in the document

• XML bases standards have found to be most useful as a data language for bioinformatics

– for data interchange between databases and other sources of data• this goes in hand with the development of ontologies

[Archard et.al., 2000]Archard et.al., 2000]

Ulf Schmitz, Data exchange standards and ontologies 10

www. .uni-rostock.de

XML

XML documents consist of elements, that are textual data structured by tags

An element consits of a Start/End tag pair, some optional/mandatory attributes defined as key/value pairs and the data between those tags

Ulf Schmitz, Data exchange standards and ontologies 11

www. .uni-rostock.de

XML Pros and Cons

• Pros:– XML is highly flexible

– human readable

– internet oriented, has rich capabilities of linking data, useful for interconnecting databases

– provides an open framework for defining standard specifications

• Cons:– overhead of text bases data formats in data parsing, storage and transmission

– source can be read an edited with any editor

– expressiveness of the XML data model would probable not be sufficient for molecular biology

Alternative formats for the management and exchange of bioinformatics data

• Flat Files (e.g. flat file libraries from EMBL, GenBank, DDBJ or Swiss-Prot)• ASN.1 Abstract Syntax Notation One (used at the NCBI for exporting GenBank data)• COBRA The Common Object Request Broker Architecture• JAVA RMI Remote Method Invocations• OODBMS Object oriented Database Management System

Ulf Schmitz, Data exchange standards and ontologies 12

www. .uni-rostock.de

XML based ontology languages• Ontology

– A system for describing knowledge, a conceptualization of a domain of interest usually made up of any or all of the following: concepts (classes), relations, attributes, constraints, objects, values.

• RDF– Resource Description Framework, a proposed W3C

standard, allows description of basic relationships between objects (subject-predicate-object semantics).

• OWL– Web ontology language, a proposed W3C standard,

is an extension of RDF to support ontologies. It provides semantics for classes and subclasses, instances, and relationships.

• OBO– Open Biomedical Ontologies (OBO) Foundry is a

collaborative experiment: to produce well-structured vocabularies introduces a new paradigm for biomedical ontology development

• Protégé– Protégé ontology and knowledge base editor. A

software tool to build an ontology and manage instances of classes defined in that ontology.

Ulf Schmitz, Data exchange standards and ontologies 13

www. .uni-rostock.de

Ontologies

Ulf Schmitz, Data exchange standards and ontologies 14

www. .uni-rostock.de

Ontologies

Ulf Schmitz, Data exchange standards and ontologies 15

www. .uni-rostock.de

Ontologies

Ulf Schmitz, Data exchange standards and ontologies 16

www. .uni-rostock.de

Ontologies

Domain Prefix Files Format

Biological imaging methods FBbi image.obo OBO

Biological process GO gene ontology.obo OBO

Cell type CL cell.obo OBO

Cellular component GO gene ontology.obo OBO

Drosophila development FBdv fly development.obo OBO

Event (INOH pathway ontology) IEV event.obo OBO

Evidence codes ECO evidence code.obo OBO

eVOC (Expressed Sequence Annotation for Humans)

EV evoc.obo.tar (v2.7) OBO

FlyBase Controlled Vocabulary FBcv flybase controlled vocabulary.obo OBO

Human disease DOID human disease.obo OBO

Ulf Schmitz, Data exchange standards and ontologies 17

www. .uni-rostock.de

Ontologies

Domain Prefix Files Format

Mammalian phenotype MP mammalian phenotype.obo OBO

MESH MESH mesh.obo OBO

Microarray experimental conditions

MO MGEDOntology.owl OWL

Molecular function GO gene ontology.obo OBO

Multiple alignment RO mao.obo OBO

NCBI organismal classification taxon taxonomy.dat plain text

OBO relationship types OBO_REL ro.obo OBO

Pathway ontology PW pathway.obo OBO

Protein domain IPR InterPro FTP directory http://www.w3.org/XML/

Protein-protein interaction MI psi-mi.obo OBO

Proteomics data and process provenance

ProPreO ProPreO.owl OWL

Sequence types and features SO so.obo OBO

Systems Biology SBO SBO_OWL.owl OWL

UniProt taxonomy Organism identification code list plain text

Ulf Schmitz, Data exchange standards and ontologies 18

www. .uni-rostock.deNew standards defining the minimal required contents

• MIAME – Minimum Information About a Microarray Experiment• MIAPE – Minimum Information About a Proteomics Experiment• MIRIAM – Minimum Information Requested In the Annotation of biochemical Models

One common suggestion among these requirements is to store metadata according to the controlled vocabulary (in ontologies) instead of free text

Other requirements are:• information about participating substances• Organisms• Literature references

Ulf Schmitz, Data exchange standards and ontologies 19

www. .uni-rostock.deMIRIAM - Minimum Information Requested In the Annotation of biochemical Models

• many of the published models in biology are lost for the community because they are either not made available or they are insufficiently characterized to allow them to be reused

• the lack of a standard description format, lack of stringent reviewing and authors’ carelessness are the main cause for incomplete model descriptions

• quantitative models will be useful only if their access and reuse is made easy for all scientists

• rules for creating quantitative models of biological systems:– use standardized, structured formats for encoding biological models (SBML, CellML)– annotate models on public repositories (Biomodels Database, Sigpath, EcoCyc, CellML

repository, JWS Online, RegulonDB, DOQCS)– the model when instantiated whithin a suitable simulation environment, must be able to produce

all relevantresults given in the reference description– annotations to be included in model (use CellML metadata or SBML simple annotation scheme):

• preferred name of the model• citation of the reference description• Name and contact information for the model creators• date and time of creation• a precise statement about the terms of distribution (‘public domain’, ‘copyrighted’, ‘freely

distributable’,’confidential’

[Novere 2005]

SBML validator

Ulf Schmitz, Data exchange standards and ontologies 20

www. .uni-rostock.de

MIRIAM example model

through standardization of the model curation process, it will be possible to create resources that are as significant to systems biology as resources like Ensembl are to genomics

Ulf Schmitz, Data exchange standards and ontologies 21

www. .uni-rostock.de

MIRIAM example model

Ulf Schmitz, Data exchange standards and ontologies 22

www. .uni-rostock.de

Survey about standards in SB

[Klipp 2005]

Ulf Schmitz, Data exchange standards and ontologies 23

www. .uni-rostock.de

Standards for graphical representation

• CellDesigner – often used tool to visualize biochemical reaction networks

• SBGN – Systems Biology Graphical Notation– Attempt to develop standards for graphical representation

• Molecular Interaction Map (Kohn maps)

there is a need for a graphical formalism that covers fundamental biochemical processes and that can be uniquely mapped

1. to mathematical objects such as ordinary differential equations (ODE) or stochastic simulation schemes, and

2. to a textual description.

Ulf Schmitz, Data exchange standards and ontologies 24

www. .uni-rostock.deA.Funahashi and H. Kitano modification of Kohn maps

A

A

State transition – changes the state of modification rather than activation

Activation

Inhibition

Translocation of module

Dashes line indicates active state of a molecule

Specific state of molecular species

Notation of the process diagram

Ulf Schmitz, Data exchange standards and ontologies 25

www. .uni-rostock.de

Molecular Interaction Maps (MIM)

• Characteristics:– Each molecule shown only in one location

• All interactions and modifications can be traced from one point• Molecules can be located from an index of map coordinates

– In “Cell Cycle eMIMs” (interactive MIMs) molecules serve as links to additional sources of information (PubMed, Gene Cards, MedMiner)

Ulf Schmitz, Data exchange standards and ontologies 26

www. .uni-rostock.de

Symbols and conventions used in eMIMs

A B

A B

C

Ph’tase

A

A

X

Y

Protein A and B can bind to each otherThe node represents the A:B complex

Multimolecular complex: x is A:B; y is (A:B):CEndless extendable

Reactions:

P

P

A B

Covalent modification of protein A. A can exist in a phosphorylated state.

Cleavage of a covalent bond: dephosphorylation of A by a phosphatase.

Stoichiometric conversion of A to B.

Ulf Schmitz, Data exchange standards and ontologies 27

www. .uni-rostock.de

Symbols and conventions used in eMIMs

A

A

Reactions:

Cytosol Nucleus

Contingencies:

Transport of A from cytosol to nucleus. The dot represents A after transport to the nucleus.

Formation of homodimer. Dot on the right represents copy of A. Dot on line represents the homodimer A:A

Enzymatic stimulation of a reaction

Enzymatic of a reaction in trans.

Stimulation of a process. Bar indicates necessity.

Inhibition

Transcriptional activation

Transcriptional inhibition

Ulf Schmitz, Data exchange standards and ontologies 28

www. .uni-rostock.de

Molecular Interaction Map (eMIM)

Ulf Schmitz, Data exchange standards and ontologies 29

www. .uni-rostock.de

Take home message

• while developing/using tools consider which data exchange formats it should be able to handle (import/export)

• while doing experiments, consider annotating them with an apropriate data exchange format to make it reusable/reproducable for others)

• defining a new data exchange format keep existing ontologies in mind helping to find a common vocabulary among the community

• while communicating your scientific procedures and results, try to observe if there is a common language used among you and your collaborators or if there is a need of common vocabulary defined in an ontology? (Don’t hasitate to create one with the help of Portege)

Ulf Schmitz, Data exchange standards and ontologies 30

www. .uni-rostock.de

Literature

• Strömbäck, L. and Hall D. and Lambrix P.: A review of standards for data exchange within systems biology. Proteomics 2007, 7, 857–867

• Achard, F. and Vaysseix, G. and Barillot, E.: XML, bioinformatics and data integration. Bioinformatics 2000, 17, 2, 115-125

• Brazma, A. and Krestyaninova, M. and Sarkans, U.: Standards for systems biology. NATURE REVIEWS GENETICS, 2006, 7, 593-605

• Novere, N. et. al.: Minimum information requested in the annotation of biochemial models (MIRIAM). Nature Biotechnology, 2005, 23, 12, 1509-1115

• Klipp, E. and Liebermeister, W. and Helbig, A. and Kowald, A. and Schaber, J.: Standards in Computational Systems Biologie. 2005

Ulf Schmitz, Data exchange standards and ontologies 31

www. .uni-rostock.de

Data exchange standards and ontologies

Thanx for your attention!!!

Ulf Schmitz, Data exchange standards and ontologies 32

www. .uni-rostock.de

Appendix

BioPAX

Ulf Schmitz, Data exchange standards and ontologies 33

www. .uni-rostock.de

Appendix

BioPAX

Ulf Schmitz, Data exchange standards and ontologies 34

www. .uni-rostock.de

Appendix

PSI MI