32
Evaluating scientific hypotheses using the SPARQL Inferencing Notation 1 Alison Callahan and Michel Dumontier Department of Biology, Carleton University ESWC2012::HyQue-SPIN

Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Embed Size (px)

DESCRIPTION

valuating a hypothesis and its claims against experimental data is an essential scientific activity. However, this task is increasingly challenging given the ever growing volume of publications and data sets. Towards addressing this challenge, we previously developed HyQue, a system for hypothesis formulation and evaluation. HyQue uses domain-specific rulesets to evaluate hypotheses based on well understood scientific principles. However, because scientists may apply differing scientific premises when exploring a hypothesis, flexibility is required in both crafting and executing rulesets to evaluate hypotheses. Here, we report on an extension of HyQue that incorporates rules specified using the SPARQL Inferencing Notation (SPIN). Hypotheses, background knowledge, queries, results and now rulesets are represented and executed using Semantic Web technologies, enabling users to explicitly trace a hypothesis to its evaluation as Linked Data, including the data and rules used by HyQue. We demonstrate the use of HyQue to evaluate hypotheses concerning the yeast galactosegene system.

Citation preview

Page 1: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

1

Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Alison Callahan and Michel Dumontier

Department of Biology, Carleton University

ESWC2012::HyQue-SPIN

Page 2: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

ESWC2012::HyQue-SPIN2

Page 3: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

ESWC2012::HyQue-SPIN3

Page 4: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Uncovering all the evidence to support/refute a hypothesis is becoming increasingly difficult

and requires a lot of digging around

Page 5: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Continuous growth in research outputs

ESWC2012::HyQue-SPIN5

Source:http://www.nlm.nih.gov/bsd/stats/cit_added.html

http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=151

Page 6: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Semantic Web technologies for biological knowledge management and discovery

• Capability to publish, link, retrieve and query de-centralized data

• A powerful integrative platform across data, ontology and services

• Formal knowledge representation allows for automated reasoning

• Massive growth in dataset availability, and soon, in application development

Page 7: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

7

A rapidly growing web of linked data

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Page 8: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Bio2RDF covers the major biological databases

Page 9: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

BioPortal gives up-to-date access to bio-ontologies

Page 10: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Mark Wilkinson, UBCMichel Dumontier, Carleton University

Christopher Baker, UNB

The Semantic Automated Discovery and Integration (SADI) framework makes it easy to create Semantic Web Services using OWL classes as service inputs and outputs

http://sadiframework.org

~700 bioinformatic services as of May 29, 2012

SADI provides access to Semantic Web Services

Page 11: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

HyQue

HyQue is the Hypothesis query and evaluation system • A platform for knowledge discovery• Facilitates hypothesis formulation and evaluation • Leverages Semantic Web technologies to provide access to

facts, expert knowledge and web services• Conforms to a simplified event-based model • Supports evaluation against positive and negative findings• Transparent and reproducible evidence prioritization • Provenance of across all elements of hypothesis testing

– trace a hypothesis to its evaluation, including the data and rules used

ESWC2012::HyQue-SPIN11Callahan A, Dumontier M, Shah NH. HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.

Page 12: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

HyQue

• Background knowledge as OWL ontologies

hypotheses (HO), processes/events (GO), measurement values (SIO), units (UO), evidence (ECO), molecules (ChEBI), biopolymers (SO), etc

• Facts as RDF data

model organism data - genes and their chromosomal location, proteins and their functions, localization and participation in interactions, complexes, pathways, biological processes, etc

• Evaluation rules defined using SPINDomain-specific rules - scores based on external knowledge

System rules - scores based on hypothesis structure

Callahan A, Dumontier M. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.

Page 13: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

HyQue Architecture

Page 14: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

A HyQue hypothesis is a collection of propositions

• proposition: “a statement expressing something true or false” • HyQue propositions specify events• complex propositions can be formulated using logical operators

(AND, OR, XOR…) or decomposed using component relations

HyQue hypothesis ≡ ‘proposition’ that ‘specifies’ only `event’)

HyQue hypothesis ≡ ‘proposition’ that `has component part’ only (`proposition’ that ‘specifies’ only `event’)

Page 15: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Event-based data model

HyQue events denote a phenomenon involving two objects: ‘agent’ and ‘target’ . In addition, we can specify the context of this event (e.g. located in nucleus, or under some genetic background)

Event ‘has agent’ agent ‘has target’ target ‘is located in’ location ‘is negated’ boolean

ESWC2012::HyQue-SPIN15

Currently supported events

1. protein-protein binding2. protein-nucleic acid binding3. molecular activation 4. molecular inhibition5. gene induction6. gene repression7. transport

Page 16: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

16

Example Hypothesis

• HyQue’s demonstrative knowledge base is focused on galactose metabolism and regulation.

The paper describes a union of hypotheses:(Gal4p induces the expression of GAL1 AND

Gal4p induces the expression of GAL7 AND

Gal3p induces the expression of GAL2)

OR

(Gal4p induces the expression of GAL7 AND

Gal80p induces the expression of GAL7 AND

Gal80p does not inhibit the activity of Gal4p

WHEN GAL3 is over-expressed)

Page 17: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

ESWC2012::HyQue-SPIN17

User Interface with auto-completionhttp://hyque.semanticscience.org

Users don’t need to know RDF to formulate hypotheses

Page 18: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Hypothesis RDF Representation

event

hypothesis

proposition

has component part

specifies

:h rdf:type hyque:Hypothesis ;

hyque:has-component-part :p1 .

:p1 rdf:type hyque:Proposition ;

hyque:specifies :e1

:e1 rdf:type hyque:Event .

Page 19: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Event RDF representation

ESWC2012::HyQue-SPIN19

event:gal4p positively regulates the expression of GAL1

:e1 rdf:type hyque:event ;

<!– positive regulation of gene expression -->

rdf:type <http://bio2rdf.org/go:0010628>;

hyque:agent <http://bio2rdf.org/sgd:Gal4p> ;

hyque:target <http://bio2rdf.org/sgd:GAL1> ;

hyque:is_negated "0";

….

Page 20: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

ESWC2012::HyQue-SPIN20

event:gal4p positively regulates the

expression of GAL1

HyQue’s SPIN rules retrieve event data, and then score it and the overall hypothesis

HyQue current contains 63 SPIN rules to evaluate hypotheses: 18 system rules, 45 domain specific rules

Page 21: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

ESWC2012::HyQue-SPIN21

Combination of system and domain rules to retrieve and score data, and add new triples

:e1 a go:0010628;hyque:agent sgd:Gal4p;hyque:target sgd:GAL1 .hyque:is_negated "0" ;

Event - induction SPIN induction rule

Page 22: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

SPIN System Rule : Link Hypothesis to Evaluation

CONSTRUCT { ?this ‘has attribute’ ?hypothesisEval . ?hypothesisEval a ‘evaluation’. ?hypothesisEval ‘obtained from’ ?propositionEval . ?hypothesisEval ‘has value ?hypothesisEvalScore . } WHERE { ?this ‘has component part’ ?proposition . ?proposition ‘has attribute’ ?propositionEval . BIND(:calculateHypothesisScore(?this) AS ?hypothesisEvalScore) . BIND(IRI(fn:concat(afn:namespace(?this),

afn:localname(?this),"_", "evaluation")) AS ?hypothesisEval) . }

ESWC2012::HyQue-SPIN22

Page 23: Evaluating scientific hypotheses using the SPARQL Inferencing Notation
Page 24: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

SPIN Domain Rule: Score experimental evidence of Gene Expression Induction Event

SELECT ?induceEventScoreWHERE { BIND (:calculateInduceAgentTypeScore(?arg1) AS ?agentTypeScore) . BIND (:calculateInduceAgentFunctionTypeScore(?arg1) AS

?agentFunctionTypeScore) . BIND (:calculateInduceTargetTypeScore(?arg1) AS ?targetTypeScore) . BIND (:calculateInduceLogicalOperatorScore(?arg1) AS ?logicalOperatorScore) . BIND (:calculateInduceEventLocationScore(?arg1) AS ?eventLocationScore) . BIND (:penalizeNegation(?arg1) AS ?negationScore) . BIND (5 AS ?maxScore) . BIND (((((((?agentTypeScore + ?agentFunctionTypeScore) + ?targetTypeScore) + ?logicalOperatorScore) + ?eventLocationScore) + ?negationScore) / ?maxScore) AS ?induceEventScore) .}

ESWC2012::HyQue-SPIN24

Page 25: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

HyQue domain rules CALCULATE a quantitative measure of evidence for an event

‘induce’ rule (maximum score: 5):– Is event negated?

• If yes, subtract 2

– Is event of type ‘induce’?• If yes, add 1; if no, subtract 1

– Is agent of type ‘protein’ or ‘RNA’?• If yes, add 1; if of type ‘gene’, subtract 1

– Is target of type ‘gene’? • If yes, add 1; if no, subtract 1

– Does agent have known ‘transcription factor activity’? • If yes, add 1

– Is event located in the ‘nucleus’?• If yes, add 1; if no, subtract 1

GO:0010628

CHEBI:36080

SO:0000236

GO:0003700

GO:0005634

Page 26: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

SPIN rule, outcome and score for a GAL gene induction event

ESWC2012::HyQue-SPIN26

4/5 = 0.80

Page 27: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Can customize rules to get more evidence, but at a cost if not found

• calculateInhibitEventScore does not take into account the physical location of the event participants

• Experimental evidence suggests that physical location in the context of an inhibition event is important

• Inhibition of Gal4p activity by Gal80p is known to take place in the nucleus, yet this inhibition is interrupted when Gal80p is bound by Gal3p, which is typically found in the cytoplasm

Adding a new rule to consider location weakens the event due to lack of data (0.87 -> score 0.78)

ESWC2012::HyQue-SPIN27

(Gal4p induces the expression of GAL1 e1 AND Gal3p induces the expression of GAL2 e2 AND

Gal4p induces the expression of GAL7)OR

(Gal4p induces the expression of GAL7 AND Gal80p induces the expression of GAL7 AND Gal80p does not inhibit the activity of Gal4p

WHEN GAL3 is over-expressed)

Page 28: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Customization of rules and rulesets can generate different evidence-based evaluations

Page 29: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Reproducible eScience LOD for Hypothesis, Rules, Data and Evaluation

Page 30: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Summary

• HyQue is a system that facilitates the formulation and evaluation of scientific hypotheses against formalized knowledge on the Semantic Web.

• This work focused on the development and incorporation of recursive SPIN rules to obtain and score events and multi-event hypotheses using OWL ontologies and RDF-based LOD.

ESWC2012::HyQue-SPIN30

Page 31: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

Future Directions

• Collaborative, end user-centered environment to engineer, share, compare and evaluate hypotheses

• Investigate alternative scoring systems• Structure knowledge beyond the GAL

network– EU/US Collaborations on disease-centered

research hypotheses– Applications for clinical decision support

ESWC2012::HyQue-SPIN31

Page 32: Evaluating scientific hypotheses using the SPARQL Inferencing Notation

[email protected]

ESWC2012::HyQue-SPIN

Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier