Upload
michel-dumontier
View
2.439
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A presentation for the March 2012 Protege Short Course http://protege.stanford.edu/shortcourse/protege-owl/201203/index.html
Citation preview
1
Real World Applications of OWL
Michel Dumontier, Ph.D.
Associate Professor of BioinformaticsDepartment of Biology, School of Computer Science, Institute of Biochemistry,
Carleton UniversityOttawa Institute of Systems Biology
Ottawa-Carleton Institute of Biomedical EngineeringProfesseur Associé, Université Laval
Visiting Associate Professor, Stanford University
Protege Short Course::Dumontier:March 2012
Ontologies in Use
• Knowledge Capture (Rightfield)• Formalization and Verification (SNOMED-CT)• Consistency Checking (SBML Harvester)• Classification (Phosphatases, Compounds)• Semantic Annotation (Array Express/ Gene Expression Atlas,
Semantic Assistant)• Query Formulation (Array Express/ Gene Expression Atlas)• Query Answering (KUPD)• Search & co-occurence (gopubmed)• Semantic Assistant• Hypothesis Testing (HyQue)• Disease Similarity and Model Organism prediction
(phenomeBLAST)• Function Prediction (genemania)
Protege Short Course::Dumontier:March 20122
Knowledge CaptureRightfield
Protege Short Course::Dumontier:March 20123
K.Wolstencroft, S.Owen, M.Horridge, O.Krebs, W.Mueller, JL. Snoep, F.Preez, C.Goble RightField: Embedding ontology annotation in spreadsheets. Bioinformatics (2011), May 2011
FormalizationSNOMED-CT
• SNOMED-CT (Clinical Terms) ontology
• used in healthcare systems of more than 15 countries, including Australia, Canada, Denmark, Spain, Sweden and the UK
• also used by major US providers, e.g., Kaiser Permanente
• ontology provides common vocabulary for recording clinical data
• 395036 classes
Protege Short Course::Dumontier:March 20124
SNOMED-CT
• Pattern based knowledge capture• need training and an information system to
implement
Protege Short Course::Dumontier:March 20125
SNOMED - verification
• Kaiser Permanente extending SNOMED to express, e.g.:– non-viral pneumonia (negation)– infectious pneumonia is caused by a virus or a bacterium
(disjunction)– double pneumonia occurs in two lungs (cardinalities)
• This is easy in SNOMED-OWL– but reasoner failed to find expected subsumptions, e.g., that
bacterial pneumonia is a kind of non-viral pneumonia
• Ontology highly under-constrained: need to add disjointness axioms (at least)– virus and bacterium must be disjoint
- Ian Horrocks OWL2 tutorialProtege Short Course::Dumontier:March 20126
SNOMED
• Adding disjointness led to surprising results– many classes become inconsistent, e.g., percutanious
embolization of hepatic artery using fluoroscopy guidance
• Cause of inconsistencies identified as class groin– groin asserted to be subclass of both abdomen and
leg– abdomen and leg are disjoint– modelling of groin (and other similar “junction”
regions) identified as incorrect
- Ian Horrocks OWL2 tutorialProtege Short Course::Dumontier:March 20127
Consistency CheckingFormalization of SBML annotations into
OWL ontologies
• Biomodels contains hundreds of quantitative models
• SBML is an XML-based format for specifying models and their parameters
• Models and their components are being semantically annotated
• Use the ontologies to validate the assertions
Protege Short Course::Dumontier:March 20128
Integrating systems biology models and biomedical ontologies.Hoehndorf R, Dumontier M, Gennari JH, Wimalaratne S, de Bono B, Cook DL, Gkoutos GV.BMC Syst Biol. 2011 Aug 11;5:124.
Additional annotations are specified using the Resource Description Framework (RDF)
<species metaid="_525530" id="GLCi" compartment="cyto"
initialConcentration="0.097652231064563">
<annotation> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-
ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/"
xmlns:bqmodel="http://biomodels.net/model-qualifiers/"> <rdf:Description rdf:about="#_525530">
<bqbiol:is> <rdf:Bag>
<rdf:li rdf:resource="urn:miriam:obo.chebi:CHEBI%3A4167"/>
<rdf:li rdf:resource="urn:miriam:kegg.compound:C00031"/> </rdf:Bag> </bqbiol:is>
</rdf:Description> </rdf:RDF>
</annotation> </species>
object
predicate
The intent is to express that the species represents a substance composed of glucose moleculesWe also know from the SBML model that this substance is located in the cytosol and with a (initial)
concentration of 0.09765M
The annotation element stores the
RDFsubject
Implicit subject and xml attributes
Protege Short Course::Dumontier:March 2012
9
OWL Axiom:M SubClassOf: represents some MaterialEntity
Conversion rule: a Model annotated with class C represents:
If C is a SubClassOf MaterialEntity then M SubClassOf: represents some C
If C is a SubClassOf Function then M SubClassOf: represents some (has-function some C)
If C is a SubClassOf Process then M SubClassOf: represents some (has-function some
(realized-by only C))
For each model annotation, we make a commitment to what it represents
Protege Short Course::Dumontier:March 2012
10
Protege Short Course::Dumontier:March 2012
11
Model verification
After reasoning, we found 27 models to be inconsistent
reasons1. our representation - functions sometimes found in the place
of physical entities (e.g. entities that secrete insulin). better to constrain with appropriate relations
2. SBML abused – e.g. species used as a measure of time3. Incorrect annotations - constraints in the ontologies
themselves mean that the annotation is simply not possible
Protege Short Course::Dumontier:March 2012
12
Finding inconsistencies with axiomatically enhanced ontologies
ATPase activity (GO:0004002) is a Catalytic activity that has Water and ATP as input, ADP and phosphate as output and is a part of an ATP catabolic process.To this, we add:• GO: ATP + Water the only inputs (universal quantification)• ChEBI: Water, ATP, alpha-D-glucose 6-phosphate are all
different (disjointness)• “ATP” input to “ATPase” reaction, which is annotated with
ATPase activity. The species “ATP”, however, is mis-annotated with Alpha-D-glucose 6-phosphate (CHEBI:17665), not with ATP.
• Unsatisfiable -> curation error in BIOMD0000000176 and BIOMD0000000177 models of anaerobic glycolysis in yeast.
Protege Short Course::Dumontier:March 2012
13
Classification:Phosphotases
• Bioinformaticians use tools to identify functional domains (e.g., InterProScan)
• Tools simply show the presence of domains - they do not classify proteins
• Experts classify proteins according to domain arrangements - the presence and number of each domain is important
14
PhosphaBase: an ontology-driven database resource for protein phosphatases.Wolstencroft KJ, Stevens R, Tabernero L, Brass A. Proteins. 2005 Feb 1;58(2):290-4.
Protege Short Course::Dumontier:March 2012
Phosphatase Functional Domains
15 Protege Short Course::Dumontier:March 2012
Defining Protein Phosphatases
• Necessary and sufficient conditions are stipulated using EquivalentClass axioms
• A protein phosphatase is exactly a protein that consists of exactly one transmembrane domain and contains at least one phosphotase domain
ProteinPhosphatase EquivalentTo: Protein AND hasDomain 1 transMembraneDomain AND hasDomain min 1 PhosphataseCatalyticDomain
16 Protege Short Course::Dumontier:March 2012
17
More precise class expressions can be formulated for subtypes
Inclusion of universal quantifier now restricts the domains to only the types listed
R2A EquivalentTo: ProteinAND hasDomain 2 ProteinTyrosinePhosphataseDomain AND hasDomain 1 TransmembraneDomain AND hasDomain 4 FibronectinDomainsAND hasDomain 1 ImmunoglobulinDomain AND hasDomain 1 MAMDomainAND hasDomain 1 Cadherin-LikeDomainAND hasDomain only (TyrosinePhosphataseDomain OR TransmembraneDomain OR FibronectinDomain OR ImnunoglobulinDomain OR Clathrin-LikeDomain OR ManDomain)
Protege Short Course::Dumontier:March 2012
hydroxyl groupmethyl group
Knowledge of functional groups is important in chemical synthesis,
pharmaceutical design and lead optimization.
Functional groups describe chemical reactivity in terms of
atoms and their connectivity, and exhibits characteristic chemical
behavior when present in a compound.
Describing chemical functional groups in OWL-DL for the classification of chemical compounds
N Villanueva-Rosales, M Dumontier. 2007. OWLED, Innsbruck, Austria.
Ethanol
Protege Short Course::Dumontier:March 201218
Describing Functional Groups in DL
HydroxylGroup: CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom)
OHR
R group
Protege Short Course::Dumontier:March 201219
Fully Classified Ontology
35 FG
Protege Short Course::Dumontier:March 201220
And, we define certain compounds
Alcohol: OrganicCompound that (hasPart some HydroxylGroup)
Protege Short Course::Dumontier:March 201221
Organic Compound Ontology
28 OC
Protege Short Course::Dumontier:March 201222
Question Answering:Classes as self-contained queries
• Query PubChem, DrugBank and dbPedia
Protege Short Course::Dumontier:March 201223
Querying Kidney and Urinary Knowledge Base and Ontology
KUPO Ontology
Entre gene
Gene X GO:0054426go:biological_process
Gene YMA:00345
kupo:002444
PT epithelial cell
rdfs:label
ro:part_of
MA:00456
kupo:004672
DT epithelial cell
rdfs:label
ro:part_of
Higgings Dataset
MA:000345
kupo:expressed_in
Gene YMA:00456
kupo:expressed_in
Proximal tubule
Distal tubule
Gene X
Query: What are the genes involved in Proteins transport expressed in Proximal Tubule Epithelial Cell?
24 Protege Short Course::Dumontier:March 201224
Semantic Annotation and Query
AE/GEO acquire
>250,000 Assays
>10,000 experiment
s
Re-annotate & summarizeATLAS
ArrayExpress
Curation Curation
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Protege Short Course::Dumontier:March 201225
ontology-based data exploration
Query for Cell adhesion genes in all ‘organism parts’
‘View on EFO’
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Protege Short Course::Dumontier:March 201226
Ontology-based query expansion for ArrayExpress Archive @ www.ebi.ac.uk/arrayexpress
Protege Short Course::Dumontier:March 201227
Search and Co-Occurrence
Protege Short Course::Dumontier:March 201228
Semantic Assistantservices relevant for the user's current task are offered directly within a desktop application. This approach relies on ontology-described semantic web services to provide external natural language processing (NLP) pipelines
Leverage of OWL-DL axioms in a Contact Centre for Technical Product SupportAlex Kouznetsov, Bradley Shoebottom, René Witte, Christopher JO Baker. OWLED 2010.
Protege Short Course::Dumontier:March 201229
Plug-in for Open Office Client
Protege Short Course::Dumontier:March 201230
• HyQue helps construct and evaluate (automatically obtain support for) hypotheses using formalized background knowledge and data using the Semantic Web
• HyQue makes it possible to develop a reliability model around data based on our scientific expectations of corroborating evidence
Protege Short Course::Dumontier:March 201231
Callahan A, Dumontier M, Shah NH. HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
Callahan A, Dumontier M. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012. Accepted.
Hypothesis
h1:
e1 (Gal4p induces expression of GAL1)
h2:
e2 (Gal3p induces expression of GAL2
e3 AND Gal4p induces expression of GAL7)
h3:
e4 (Gal4p induces expression of GAL7
e5 AND Gal80p inhibits production of Gal4p
when GAL3 is over-expressed
e6 AND Gal80p induces expression of GAL7)
• simple event-based expression
• conjunctive hypothesis – must satisfy two expressions
• conjunctive hypothesis with conditional expression
Protege Short Course::Dumontier:March 201232
HYQUE ARCHITECTURE
Callahan A, Dumontier M, Shah NH. HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
Callahan A, Dumontier M. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012. Accepted.
Protege Short Course::Dumontier:March 201233
Rule-based assessment of evidence
• ‘induce’ rule (maximum score: 5):– Is event negated?
• If yes, subtract 2
– Is logical operator ‘induce’?• If yes, add 1; if no, subtract 1
– Is agent of type ‘protein’ or ‘RNA’?• If yes, add 1; if of type ‘gene’, subtract 1
– Is target of type ‘gene’? • If yes, add 1; if no, subtract 1
– Does agent have known ‘transcription factor activity’? • If yes, add 1
– Is event located in the ‘nucleus’?• If yes, add 1; if no, subtract 1
GO:0010628
CHEBI:36080
SO:0000236
GO:0003700
GO:0005634
Protege Short Course::Dumontier:March 201234
Linked Open Results : from hypothesis to evidence
Protege Short Course::Dumontier:March 201235
Literature-Based Enrichment Analysis
• Enrichment analysis on terms extracted using a target ontology for associated articles.
Protege Short Course::Dumontier:March 201236
Enabling enrichment analysis with the Human Disease Ontology. Paea LePendu, , Mark A. Musen, Nigam H. Shah. Journal of Biomedical Informatics. Volume 44, Supplement 1, December 2011, Pages S31–S38
Protege Short Course::Dumontier:March 201237
Phenotype-based predictions
Phenotypes can be used as a substrate to cluster similar diseases, identify potential model systems, predict potential disease-treating drugs or their adverse events, drug repurposing, etc
Protege Short Course::Dumontier:March 201238
Robert Hoehndorf, Paul N. Schofield and Georgios V. Gkoutos. PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleid Acids Research, 2011.
Linking pharmgkb to phenotype studies and animal models of disease for drug repurposing.Hoehndorf R, Oellrich A, Rebholz-Schuhmann D, Schofield PN, Gkoutos GV. Pac Symp Biocomput. 2012:388-99.
CK Chen, CJ Mungall, GV Gkoutos et al. MouseFinder: candidate disease genes from mouse phenotype data. Human Mutation 2012
Tetralogy of Fallot
Protege Short Course::Dumontier:March 201239
OMIM
Human Phenotype Ontology
Phenotype ontologies should contain descriptions of
morphological, behavioural, physiological, developmental
characteristics
Compare Diseases based on their Phenotypes
Protege Short Course::Dumontier:March 201240
Comparison using Weighted Jaccard – uses information content for a phenotype regarding genotype or disease
Inferring equivalent phenotypes by reasoning over OWL ontologies
human ‘overriding aorta [HP:0002623]’ EquivalentTo:
‘phenotype of’ some (‘has part’ some (‘aorta [FMA:3734]’ and ‘overlaps with’ some ‘membranous part of interventricular septum [FMA:7135]’)
mouse ‘overriding aorta [MP:0000273 ]’ EquivalentTo:
‘phenotype of’ some (‘has part’ some (‘aorta [MA:0000062]’ and ‘overlaps with’ some ‘membranous interventricular septum [MA:0002939]’
Uberon super-anatomy ontology provides inter-species mappings
‘aorta [FMA:3734]’ EquivalentTo: ‘aorta [MA:0002939]’
‘membranous part of interventricular septum [FMA:3734]’ EquivalentTo: ‘membranous interventricular septum [MA:0000062]
Thus, ‘overriding aorta [HP:0002623] EquivalentTo:‘overriding aorta[MP:0000273]’
Protege Short Course::Dumontier:March 201241
Identifying potential mouse models for human diseases
Protege Short Course::Dumontier:March 201242
Quantitative ROC Analysis prediction against curated models yields 0.89 AUC
Prediction of Tetralogy of Fallot added by MGI
Conclusion
• OWL has come of age and can be used in an increasing number of scientific investigations and applications
• OWL applications cover knowledge capture, formalization, verification, classification, semantic annotation, query formulation, query answering, search, hypothesis testing and prediction
Protege Short Course::Dumontier:March 201243