Upload
michel-dumontier
View
2.163
Download
0
Tags:
Embed Size (px)
Citation preview
ChEBI User Group Meeting:June 24, 2010 1
We’re all SMILES! Building Chemical Semantic Web Services
with SADI, ChEBI, and CHEMINF
Michel Dumontier, Ph.D.Associate Professor of Bioinformatics
Carleton University
Department of BiologySchool of Computer Science
Institute of BiochemistryOttawa Institute of Systems Biology
Ottawa-Carleton Institute of Biomedical Engineering
ChEBI User Group Meeting:June 24, 2010 3
Syntactic Web…It takes a lot of digging to get answers
ChEBI User Group Meeting:June 24, 2010 4
Surface web:167 terabytes
Deep web:91,000 terabytes
545-to-one
We need to get to the deep web
ChEBI User Group Meeting:June 24, 2010
and tap into the global web of structured knowledge
5
ChEBI User Group Meeting:June 24, 2010 6
The Semantic Web is the new global web of knowledge
It is about standards for publishing, sharing and querying knowledge drawn from diverse sources
It makes possible the answeringsophisticated questions using
background knowledge
Goals
• Provision chemical data on the Web
• Find cheminformatic services that will consume the data
• Answer questions about chemicals by reasoning over essential chemical knowledge
ChEBI User Group Meeting:June 24, 2010 7
Is caffeine a drug-like molecule?
ChEBI User Group Meeting:June 24, 2010 8
Lipinski Rule of Five• Rule of thumb for druglikeness (orally active in humans)
(4 rules with multiples of 5)– Less than 500 Dalton– Less than 5 hydrogen bond donors– Less than 10 hydrogen bond acceptors– A partition coefficient value between -5 and 5
• We need a more formal (machine understandable) description
ChEBI User Group Meeting:June 24, 2010 9
ChEBI User Group Meeting:June 24, 2010 10
Formal Ontology as a Strategy
ChEBI User Group Meeting:June 24, 2010 11
The Web Ontology Language (OWL) Has Explicit Semantics
Can therefore be used to capture knowledge in a machine understandable way
Lipinski Rule of Five• Empirically derived ruleset for druglikeness
(4 rules with multiples of 5)– Less than 500 Dalton– Less than 5 hydrogen bond donors– Less than 10 hydrogen bond acceptors– A partition coefficient value between -5 and 5
• A formal description using OWL:
ChEBI User Group Meeting:June 24, 2010 12
To calculate these attributes, we need access to a computable representation
of the molecular structure
ChEBI User Group Meeting:June 24, 2010 13
ball & stick model for caffeine
The chemical graph specifies the type and connectivity of atoms in molecules. It describes
a part of chemical structure
SMILES strings are common representations of the chemical graph
ChEBI User Group Meeting:June 24, 2010 14
ball & stick model for caffeine
SMILES string for caffeine
Cn1cnc2n(C)c(=O)n(C)c(=O)c12
Chemical descriptors
• Chemical descriptors are data (quantities or values) that provide information about substances, molecular entities, and their parts (rings, atoms, bonds, etc).
• Sometimes they enumerate material parts, they quantify or describe qualities, functions or dispositions
• Often used to build Quantitative Structure Activity Relationships (QSAR) models
• Example descriptors :– Mass values– Partition coefficients– Heats of formation– Aromaticity values– Molecular formulas
ChEBI User Group Meeting:June 24, 2010 15
The Chemical Information Ontology (CHEMINF)
• 100 chemical descriptors• 50 chemical qualities• Relates descriptors to their
specifications, the software that generated them (along with the running parameters, and the algorithms that they implement)
• Contributors: Nico Adams, Leonid Chepelev, Michel Dumontier, Janna Hastings, Egon Willighagen, Peter Murray-Rust, Cristoph Steinbeck
ChEBI User Group Meeting:June 24, 2010 16
http://semanticchemistry.googlecode.com
CHEMINF provides the vocabulary to define an input (SMILES-annotated molecule) and an output
(molecule annotated with a descriptor)
ChEBI User Group Meeting:June 24, 2010 17
Ultimately, the goal is to use an OWL reasoner to reason about the attributes to determine
whether the compound is drug-like
ChEBI User Group Meeting:June 24, 2010 18
Semantic Automated Discovery and Integration
http://sadiframework.org
Mark Wilkinson, UBCMichel Dumontier, Carleton UniversityChristopher Baker, UNB
SADI is a framework to create Semantic Web services using OWL classes as service inputs and outputs
19ChEBI User Group Meeting:June 24, 2010
SADI
• OWL classes in SADI are local to individual services
– They should uniquely specify the service input and outputs (they exactly have the right restrictions)
– one service’s world-view can conflict with another,but a client can use any or all
• maximize interoperability by reusing types and relations
ChEBI User Group Meeting:June 24, 2010 20
Create code stubs using the ontology
• Publish the ontology to a web-accessible locationhttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl
• Make sure that the class names are resolvable(easy when using the hash notation)
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smiles-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#logp-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hbdc-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hdba-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#lipinksi-druglike-molecule
• Download/checkout the codehttp://sadiframework.org
• Run the code generator – specify the URIs that correspond to input and output types
ChEBI User Group Meeting:June 24, 2010 21
Implement the functionality
• Java version – Uses Jena to manipulate the RDF graph– Uses Maven to build from command-line or Eclipse; Invokes
Jetty for service testing
• Chemistry– We used the Chemistry Development Kit (CDK) to implement 4
services
ChEBI User Group Meeting:June 24, 2010 22
Working with the service (GET)
• Responds to a GET by providing the service description in RDF– conforms to Feta (BioMoby, myGrid)
ChEBI User Group Meeting:June 24, 2010 23
curl http://cbrass.biordf.net/logpdc/logpc<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:j.0="http://www.mygrid.org.uk/mygrid-moby-service#" > <rdf:Description rdf:about=""> <j.0:hasServiceDescriptionText>no description</j.0:hasServiceDescriptionText> <j.0:hasServiceNameText rdf:datatype="http://www.w3.org/2001/XMLSchema#string">logpc</j.0:hasServiceNameText> <j.0:hasOperation rdf:resource="#operation"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#serviceDescription"/> </rdf:Description> <rdf:Description rdf:about="#input"> <j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smilesmolecule"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/> </rdf:Description> <rdf:Description rdf:about="#operation"> <j.0:outputParameter rdf:resource="#output"/> <j.0:inputParameter rdf:resource="#input"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#operation"/> </rdf:Description> <rdf:Description rdf:about="#output"> <j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#alogpsmilesmolecule"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/> </rdf:Description></rdf:RDF>
Working with the service (POST)
• Responds to a POST with service output (process an input file)
ChEBI User Group Meeting:June 24, 2010 24
<rdf:Description rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#mdalogp"> <rdf:type rdf:resource="http://semanticscience.org/resource/CHEMINF_000251"/> <j.0:SIO_000300 rdf:datatype="http://www.w3.org/2001/XMLSchema#double">-0.4311000000000006</j.0:SIO_000300> </rdf:Description>
<rdf:RDF xmlns="http://semanticscience.org/sadi/ontology/caffeine.rdf#" xmlns:so="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:sio="http://semanticscience.org/resource/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> <so:smilesmolecule rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#m"> <sio:SIO_000008 rdf:resource = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles"/> </so:smilesmolecule> <sio:CHEMINF_000018 rdf:about = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles"> <sio:SIO_000300 rdf:datatype="xsd:string">Cn1cnc2n(C)c(=O)n(C)c(=O)c12</sio:SIO_000300> </sio:CHEMINF_000018></rdf:RDF>
curl --data @caffeine.rdf http://cbrass.biordf.net/logpdc/logpc
Publish and Register the service
ChEBI User Group Meeting:June 24, 2010 25
http://sadiframework.org/registry
Now what?
ChEBI User Group Meeting:June 24, 2010 26
ChEBI User Group Meeting:June 24, 2010 27
Semantic Health and Research Environment
SHARE is an application that execute (SPARQL) queries as workflows over SADI Services
“Reckoning”
dynamic discovery of instances of OWL classes through synthesis and invocation of a Web Service workflow capable of generating data described by the OWL class restrictions, followed by reasoning to classify the data
into that ontology
28ChEBI User Group Meeting:June 24, 2010
ChEBI User Group Meeting:June 24, 2010 29
SPARQL is the new cool kid on the query block
SQL SPARQL
SHARE
• SPARQL engine– triple patterns are matched against service
descriptions
– knowledge base is dynamically populated
– queries can contain OWL classes, which are expanded to the required triple patterns
– query is optimized to minimize the number of service calls and the amount of data sent over the network
ChEBI User Group Meeting:June 24, 2010 30
ChEBI has data!
ChEBI User Group Meeting:June 24, 2010 31
Bio2RDF provides ChEBI in RDF
ChEBI User Group Meeting:June 24, 2010 32
ChEBI User Group Meeting:June 24, 2010 33
Bio2RDF now serving over 40 billion triples of linked biological data
ChEBI User Group Meeting:June 24, 2010 34
An increasing amount of machine understandable chemical data
ChEBI User Group Meeting:June 24, 2010 35
Dataset Source
ChEBI Bio2RDF
PubChem Bio2RDF
DrugBank Bio2RDF
KEGG Bio2RDF
PDB Bio2RDF
PharmGKB Bio2RDF
CTD Bio2RDF
TCM LODD
Medicare LODD
SIDER LODD
ChEMBL LODD
DailyMed LODD
Query for log p
ChEBI User Group Meeting:June 24, 2010 36
Query: Is caffeine a drug-like molecule?
ChEBI User Group Meeting:June 24, 2010 37
ChEBI User Group Meeting:June 24, 2010 38
SADI
• Describe the input and output using OWL-DL classes
• Subject of input and output must be the same
• Web services indexed by predicates
• Biocatalogue will list SADI-compliant services
• Taverna plugin to work with SADI services
• Protégé 4.1 plugin to create SADI services
• Simplified migration path for existing web services (java, perl)
Benefits
• Data remains distributed – no warehouse!
• Data is not “exposed” as a SPARQL endpoint– greater provider-control over computational
resources
• Yet data appears to be a SPARQL endpoint… no modification of SPARQL or reasoner required.
ChEBI User Group Meeting:June 24, 2010 39
Join Us!
• SADI and CardioSHARE are Open Source
• Come join us – we’re having a lot of fun!!
http://sadiframework.org
ChEBI User Group Meeting:June 24, 2010 40
ChEBI User Group Meeting:June 24, 2010 41
Acknowledgements
This research is supported by The Heart + Stroke Foundation of BC and Yukon, Microsoft Research, The Canadian Institutes of Health Research, The Natural Sciences and Engineering Research Council of Canada and CANARIE.
Leonid Chepelev (implementing the services)
Luke McCarthy (technical support)
Mark Wilkinson (vision and leadership)
CHEMINF Group
Janna HastingsNico AdamsEgon Willighagen