38
Towards a integrated network of data and services for the life sciences 1 Michel Dumontier, Ph.D. Associate Professor of Bioinformatics Carleton University Department of Biology School of Computer Science Institute of Biochemistry Ottawa Institute of Systems Biology Ottawa-Carleton Institute of Biomedical Engineering

2010 CASCON - Towards a integrated network of data and services for the life sciences

Embed Size (px)

DESCRIPTION

Towards a integrated network of data and services for the life sciences Modern biological knowledge discovery requires access to machine-understandable data that can be searched, retrieved, and subsequently analyzed using a wide array of analytical software and services. The Semantic Automated Discovery and Integration (SADI) framework is a set of conventions to formalize web service inputs and outputs using OWL ontologies that enable the automatic discovery and invocation of Semantic Web services. In this talk, I will walk through a worked example in the design and deployment of chemical semantic web services using the Chemical Development Toolkit, chemical descriptors from the Chemical Information Ontology (CHEMINF), and the Semanticscience Integrated Ontology (SIO) as a unifying, upper level ontology of basic types and relations. I will discuss how one can make use of the SADI-enabled SHARE client to reason about data obtained from Bio2RDF, the largest linked open data project, and automatically invoke chemical semantic web services to determine a chemical's drug-likeness. If you want to see the potential of the Semantic Web being realized, this talk is for you.

Citation preview

Page 1: 2010 CASCON - Towards a integrated network of data and services for the life sciences

1

Towards a integrated network of data and services for the life sciences

Michel Dumontier, Ph.D.Associate Professor of Bioinformatics

Carleton University

Department of BiologySchool of Computer Science

Institute of BiochemistryOttawa Institute of Systems Biology

Ottawa-Carleton Institute of Biomedical Engineering

Page 2: 2010 CASCON - Towards a integrated network of data and services for the life sciences

2

Finding the right information to answer a question is hardand sometimes requires a sophisticated workflow

Page 3: 2010 CASCON - Towards a integrated network of data and services for the life sciences
Page 4: 2010 CASCON - Towards a integrated network of data and services for the life sciences

4

What if we could answer a question by automatically building a knowledge base

using both data and services?

Page 5: 2010 CASCON - Towards a integrated network of data and services for the life sciences

5

The Semantic Web is a web of knowledge.

It is about standards for publishing, sharing and querying knowledge drawn from diverse sources

It enables the answering of sophisticated questions

Page 6: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Is caffeine a drug-like molecule?

Page 7: 2010 CASCON - Towards a integrated network of data and services for the life sciences

To answer this question we need to know:

• what ‘drug like molecule’ really means• caffeine’s molecular structure• use the structural information to compute the attributes• determine whether caffeine satisfies the requirements of being ‘drug like’

Is caffeine a drug-like molecule?

Page 8: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Lipinski Rule of Five

• Rule of thumb for druglikeness (orally active in humans)(4 rules with multiples of 5)– mass of less than 500 Daltons– fewer than 5 hydrogen bond donors– fewer than 10 hydrogen bond acceptors– A partition coefficient value between -5 and 5

We need a more formal (machine understandable) description of a ‘drug-like molecule’ which specifies values for chemical descriptors

Page 9: 2010 CASCON - Towards a integrated network of data and services for the life sciences

9

ontology as a strategy to

formally represent knowledge

Page 10: 2010 CASCON - Towards a integrated network of data and services for the life sciences

10

The Web Ontology Language (OWL) Has Explicit Semantics

Can therefore be used to capture knowledge in a machine understandable way

Page 11: 2010 CASCON - Towards a integrated network of data and services for the life sciences

The Chemical Information Ontology (CHEMINF)

• 100+ chemical descriptors• 50+ chemical qualities• Relates descriptors to their

specifications, the software that generated them (along with the running parameters, and the algorithms that they implement)

• Contributors: Nico Adams, Leonid Chepelev, Michel Dumontier, Janna Hastings, Egon Willighagen, Peter Murray-Rust, Cristoph Steinbeck

11

http://semanticchemistry.googlecode.com

Page 12: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Molecular structure can be represented using a SMILES string, which is a common representation

of the chemical graph

12

ball & stick model for caffeine

SMILES string for caffeine

Cn1cnc2n(C)c(=O)n(C)c(=O)c12

Page 13: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Lipinski Rule of Five• Empirically derived ruleset for druglikeness

(4 rules with multiples of 5)– mass of less than 500 Daltons– fewer than 5 hydrogen bond donors– fewer than 10 hydrogen bond acceptors– A partition coefficient value between -5 and 5

• A formal description using OWL:

Page 14: 2010 CASCON - Towards a integrated network of data and services for the life sciences

What we then need are services that will consume SMILES strings and annotate the molecule with the required chemical

descriptors

14

then we can reason about whether it satisfies the drug-likeness definition

Page 15: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Semantic Automated Discovery and Integration

http://sadiframework.org

Mark Wilkinson, UBCMichel Dumontier, Carleton UniversityChristopher Baker, UNB

SADI is a framework to create Semantic Web services using OWL classes as service inputs and outputs

15

Page 16: 2010 CASCON - Towards a integrated network of data and services for the life sciences

SADI

• OWL classes in SADI are local to individual services

– They should uniquely specify the service input and outputs (they exactly have the right restrictions)

– one service’s world-view can conflict with another,but a client can use any or all

• maximize interoperability by reusing types and relations

Page 17: 2010 CASCON - Towards a integrated network of data and services for the life sciences

17 CASCON: Nov 3, 2010

Semanticscience Integrated Ontology (SIO)

• OWL2 ontology• 800 classes covering basic types (physical, processual,

informational) with an emphasis on biological entities• 129 basic relations (mereological, participatory, attribute/quality,

spatial, temporal and representational)• axioms can be used by reasoners to generate inferences for

consistency checking, classification and answering questions about life science knowledge

• embodies emerging ontology design patterns• dereferenceable URIs• searchable in the NCBO bioportalhttp://semanticscience.org/ontology/sio.owl

Page 18: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Create code stubs using the ontology

• Publish the ontology to a web-accessible locationhttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl

• Make sure that the class names are resolvable(easy when using the hash notation)

http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smiles-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#logp-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hbdc-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hdba-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#lipinksi-druglike-molecule

• Download/checkout the codehttp://sadiframework.org

• Run the code generator – specify the URIs that correspond to input and output types

18

Page 19: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Implement the functionality

• Java version – Uses Jena to manipulate the RDF graph– Uses Maven to build from command-line or Eclipse; Invokes Jetty for

service testing

• Chemistry– We used the Chemistry Development Kit (CDK) to implement 4

services

19

Page 20: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Responds to a GET operation by providing the service description in RDF

conforms to Feta (BioMoby, myGrid)

20

curl http://cbrass.biordf.net/logpdc/logpc<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:j.0="http://www.mygrid.org.uk/mygrid-moby-service#" > <rdf:Description rdf:about=""> <j.0:hasServiceDescriptionText>no description</j.0:hasServiceDescriptionText> <j.0:hasServiceNameText rdf:datatype="http://www.w3.org/2001/XMLSchema#string">logpc</j.0:hasServiceNameText> <j.0:hasOperation rdf:resource="#operation"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#serviceDescription"/> </rdf:Description> <rdf:Description rdf:about="#input"> <j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smilesmolecule"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/> </rdf:Description> <rdf:Description rdf:about="#operation"> <j.0:outputParameter rdf:resource="#output"/> <j.0:inputParameter rdf:resource="#input"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#operation"/> </rdf:Description> <rdf:Description rdf:about="#output"> <j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#alogpsmilesmolecule"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/> </rdf:Description></rdf:RDF>

Page 21: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Responds to a POST containing service input with a service output in RDF

21

<rdf:Description rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#mdalogp"> <rdf:type rdf:resource="http://semanticscience.org/resource/CHEMINF_000251"/> <j.0:SIO_000300 rdf:datatype="http://www.w3.org/2001/XMLSchema#double">-0.4311000000000006</j.0:SIO_000300> </rdf:Description>

<rdf:RDF xmlns="http://semanticscience.org/sadi/ontology/caffeine.rdf#" xmlns:so="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:sio="http://semanticscience.org/resource/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> <so:smilesmolecule rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#m"> <sio:SIO_000008 rdf:resource = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles"/> </so:smilesmolecule> <sio:CHEMINF_000018 rdf:about = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles"> <sio:SIO_000300 rdf:datatype="xsd:string">Cn1cnc2n(C)c(=O)n(C)c(=O)c12</sio:SIO_000300> </sio:CHEMINF_000018></rdf:RDF>

curl --data @caffeine.rdf http://cbrass.biordf.net/logpdc/logpc

Page 22: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Now what?

22

Page 23: 2010 CASCON - Towards a integrated network of data and services for the life sciences

23

Semantic Health and Research Environment

SHARE is an application that execute (SPARQL) queries as workflows over SADI Services

Page 24: 2010 CASCON - Towards a integrated network of data and services for the life sciences

“Reckoning”

dynamic discovery of instances of OWL classes through synthesis and invocation of a Web Service workflow capable of generating data described by the OWL class restrictions, followed by reasoning to classify the data

into that ontology

24

Page 25: 2010 CASCON - Towards a integrated network of data and services for the life sciences

ChEBI has (non-SW) data!

25

Page 26: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Bio2RDF provides ChEBI in RDF

26

Page 27: 2010 CASCON - Towards a integrated network of data and services for the life sciences

27

Bio2RDF is now serving over 40 billion triples of linked biological data

Page 28: 2010 CASCON - Towards a integrated network of data and services for the life sciences

28

Bio2RDF covers the major biological databases

Page 29: 2010 CASCON - Towards a integrated network of data and services for the life sciences

29

Bio2RDF is part of a growing web of linked data

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Page 30: 2010 CASCON - Towards a integrated network of data and services for the life sciences

30

something you can lookup or search for with rich descriptions

Page 31: 2010 CASCON - Towards a integrated network of data and services for the life sciences

31

SPARQL is the new cool kid on the query block

SQL SPARQL

Page 32: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Query for log p

32

Page 33: 2010 CASCON - Towards a integrated network of data and services for the life sciences

33

Page 34: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Query: Is caffeine a drug-like molecule?

34

Page 35: 2010 CASCON - Towards a integrated network of data and services for the life sciences

Benefits

• Data remains distributed – as the internet was meant to be!

• Data is not “exposed” as a SPARQL endpoint– greater provider-control over computational resources

• Service invocation is straightforward and matchmaking by reasoning about ontology-based input/output descriptions

35

Page 36: 2010 CASCON - Towards a integrated network of data and services for the life sciences

36 CASCON: Nov 3, 2010

Summary

• Semantic Web technologies offer tantalizing new opportunities to publish, share and query data and services

• Bio2RDF provides linked life science data• SADI provides a framework to provide semantic

web services• SHARE allows us to simultaneously query and

reason about data and services represented using RDF/OWL

Page 37: 2010 CASCON - Towards a integrated network of data and services for the life sciences

37

Acknowledgements

This research is supported by The Heart + Stroke Foundation of BC and Yukon, Microsoft Research, The Canadian Institutes of Health Research, The Natural Sciences and Engineering Research Council of Canada and CANARIE.

Marc-Alexandre Nolin & Francois Belleau (Bio2RDF)

Leo Chepelev (implementing the services)

Luke McCarthy (SADI technical support)

Mark Wilkinson (vision and leadership)

Chris Baker (lipidomics)

CHEMINF GroupLeo ChepelevJanna HastingsEgon WillighagenNico Adams