Upload
mark-wilkinson
View
267
Download
0
Tags:
Embed Size (px)
DESCRIPTION
the same story as usual, but with a bit more context (why it is absolutely necessary to move science in this direction). Presented to University of Potsdam, Germany, and the University of New Brunswick, Canada in December, 2012.
Citation preview
Web Science 2.0
Conducting in silico research in the Webfrom hypothesis to publication
Mark Wilkinson
Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain
Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.
Context
Multiple recent surveys of high-throughput biology
reveal that upwards of 50% of published studies
are not reproducible
- Baggerly, 2009
- Ioannidis, 2009
Context
“the most common errors are simple,
the most simple errors are common”
- Baggerly, 2009
Context
These errors pass peer review
The researcher is unaware of the error
The process that led to the error is not recorded
Therefore it cannot be detected during peer-review
Context
Discovery of such errors have resulted in retractions
and even shut-down clinical trials
Context
In March, 2012, the US Institute of Medicine said
“Enough is enough!”
ContextInstitute of Medicine Recommendations
For Conduct of High-Throughput Research:
Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.
1. Rigorously-described, -annotated, and -followed data management procedures
2. “Lock down” the computational analysis pipeline once it has been selected
3. Publish the workflow in a formal manner, together with the full starting and result datasets
Achieving these recommendations
requires integration of existing technologies
and invention of new ones
“While it took 2,300 years after the first report of angina for the condition to be commonly taught in medical curricula, modern discoveries are being disseminated at an increasingly rapid pace.”
The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009
Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.
Context
The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.
“The Singularity”
The X-intercept is where, the moment a discovery is made, it is immediately put into practice
(not only medical practice, but any research endeavour...)
The technology required to achieve this does not yet exist
Scientific research would have to be conducted within a medium that
immediately interpreted and disseminated the results...
You Are
Here
...in a form that immediately (actively!) affected the research of others...
You Are
Here
...without requiring them to be aware of these new discoveries.
You Are
Here
I’d like to show you how close we now are to this vision
and how we got there
Web Science 2.0
We wanted to duplicatea real, peer-reviewed, bioinformatics analysis
simply by building a model in the Webdescribing what the answer
(if one existed)
would look like
...the machine had to make every other decision
on it’s own
Brief Digression
“in” the Web??
How we use The Web today
By clicking here you cause this incredibly powerful computational tool called The Web to retrieve a chunk of text and images that
can only be understood by a human...
The Web is not a pigeon!
To achieve this vision
We must learn how to do research IN the Web
Not OVER the Web
Resume Speed
We wanted to duplicatea real, peer-reviewed, bioinformatics analysis
simply by building a model in the Webdescribing what the answer
(if one existed)
would look like
...the machine had to make every other decision
on it’s own
This is the study we chose:
Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).
Original Study Simplified
Using what is known about interactions in fly & yeast
predict new interactions with your human protein of interest
Given a protein P in Species X
Find proteins similar to P in Species Y
Retrieve interactors in Species Y
Sequence-compare Y-interactors with Species X genome
(1) Keep only those with homologue in X
Find proteins similar to P in Species Z
Retrieve interactors in Species Z
Sequence-compare Z-interactors with (1)
Putative interactors in Species X
“Pseudo-code” Abstracted Workflow
Modeling the answer...
OWL
Web Ontology Language (OWL) is the language approved by the W3C
for representing knowledge in the Web
Modeling the answer...
Note that every word in this diagram is, in reality, a URL (because it is OWL)
The model of the answer is published in The Weband borrows ideas from other models published in The Web
ProbableInteractor is homologous to ( Potential Interactor from ModelOrganism1…)
and
Potential Interactor from ModelOrganism2…)
Probable Interactor is defined in OWL as a subclass of Potential Interactor that requires homologous pairs of interacting proteins to exist in both
comparator model organisms.
(Effectively, an intersection)
Modeling the answer...
Publish our OWL model of a Probable Interactor
in the Web
In a local data-file
provide the protein we are interested in
and the two species we wish to use in our comparison
taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly
Running the Web Science Experiment
The tricky bit is...
In the abstract, the search for homology is
“generic” – ANY Protein, ANY model
system
But when the machine does the experiment, it
must use specific of resources because the
answer requires information from two
declared species
taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly
PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>
SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE {
?protein a i:ProbableInteractor .}
This is the question we ask:(the query language here is SPARQL)
The reference (URL) to our OWL model of the answer
Our system then derives (and executes) the following workflow automatically
These are differentWeb services!
...selected at run-time based on the same model
There are four very cool things about what you just saw...
There are four very cool things about what you just saw...
The system was able to create a workflow based on an OWL model (ontology)
There are four very cool things about what you just saw...
The system was able to create a COMPUTATIONAL workflow
based on a BIOLOGICAL model
There are four very cool things about what you just saw...
The workflow it created (i.e. the services chosen)
differed depending on context
There are four very cool things about what you just saw...
The choice of tool-selection was guidedby the encoded knowledge of domain-experts
worldwide
We got the answer
“simply” by designing a model of the answer!
How did we do that?
A “Smart” Biomedical Resource Representation System
A Web application that answers SPARQL-DL queries
Query-answering Enhanced by SADI
Demo #1
Imagine a “virtual database”
all of the data from all databases
+result of
every conceivable analysis
How can we query that database?
What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc }
What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc }
Note that there is no “FROM” clause!We don’t tell it where it should get the information, The machine has to figure that out by itself...
Enter that query into SHARE
Click “Submit”...
...and in a few seconds you get your answer.
The query results are live hyperlinksto the respective Database or images
Neither SADI nor SHARE
know anything about
plant biology or genetics
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
Note again that there is no “From” clause…
I have not told SHARE where to look for the answer, I am simply asking my question
Enter that query into SHARE
Two different providers of gene information (KEGG & NCBI); were found & accessed
Two different providers of pathway information (KEGG and GO); were found & accessed
The results are all links to the original data
Neither SADI nor SHARE
know anything about
proteins or biochemical pathways
Recapwhat we just saw
We posed, and answered ~complex multi-database queries
WITHOUT A DATA WAREHOUSE
An example from the Clinical domain
Demo #2
Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Show me the latest Blood Urea Nitrogen (BUN) and Creatinine levels of patients who appear to be
rejecting their transplants
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Likely Rejecter:
A patient who has creatinine levelsthat are increasing over time
- - Mark D Wilkinson’s definition
Likely Rejecter:
…but there is no “likely rejecter” column or table in our database…
only blood chemistry measurementsat various time-points
Likely Rejecter:
So the data required to answer this questionDOESN’T EXIST!
?
Enter that query into SHARE
Now…
Two “magical” events occur…
The machine decides
by itself
that it needs to do a Linear Regression analysis
on the blood creatinine measurementsin order to answer your question
The machine decides
by itself
how and where that analysiscan be done
and does it automatically!
http://www.impactlab.net/2009/03/22/improve-your-brain-power/
The SHARE system utilizes SADI to discover analytical services on the Web that do linear regression analysis
and sends the data to be analysed
VOILA!
Neither SADI nor SHARE
know anything about
blood chemistry, or mathematics
So how does the machine know what to do??
Ontologies
Ontologies explicitly define the kinds of things that (can) exist…
…and what those things are “like”
i.e. what properties they have (color, weight, shape, texture, temperature, “state”)
and what relationships they have to one another (inside-of, adjacent-to, part-of, binds-to, controls, inhibits,
degrades, etc.)
So we create ………….ontologies about biology
and health
We* publish them on the Web
* We… or anybody! Anybody can publish an ontology!
My definition of a Likely Rejecter is encoded in a machine-readable document written in the OWL Ontology language
Basically:
“the regression line over creatinine measurements should have an increasing slope”
Our ontology refers to other ontologies (possibly published by other people)to learn about what the properties of “regression models” are
e.g. that regression models have slopes and intercepts and that slopes and intercepts have decimal values
SHARE examines the query
Looks on the Web for ontologies that describe the problem it is trying to solve,
and “reads” them
then uses that “knowledge” to figure out which data-sources and analytical tools it
needs to answer the query
The way SHARE “interprets” data varies depending on the context of the query
(i.e. which ontologies it reads – Mine? Yours?)
and on what part of the query it is trying to answer at any given moment
(which ontological concept is relevant to that clause)
Data exhibits “late binding”
Late binding:
“purpose and meaning”of the data is
not determined untilthe moment it is required
a.k.a The “semantics” of the data
Benefitof late binding
Data is amenable toconstant re-interpretation
Example?
Blood Creatinine measurements
were not dictated to be (only)
Blood Creatinine measurements
Example?
The data had the ‘qualities/properties’ that
allowed one machine to interpret
that they were Blood Creatinine measurements
(e.g. to determine which patients were rejecting)
Example?
But the data also had the ‘qualities/properties’ that
allowed another machine to interpret them as
Simple X/Y coordinate data
(e.g. the Linear Regression calculation tool)
Benefitof late binding
Data is amenable toconstant re-interpretation
http://www.flickr.com/people/faernworks/
And that brings us to...
Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).
We built a model of the proposed answer
Our system converted the model into the experiment
The analytical tools chosen for that experiment changed depending on
context
even though the biological model driving their selection was the same
i.e.
The published model is re-usable
i.e.
The published model is re-usable
In different contexts... By different researchers
and because the model IS the experiment
the published EXPERIMENT is re-usable!!
simply point the same query at your own dataset...
The publication is an executable document!
Every component of the model
Every component of the input data
Every component of the output data
is a URL
Therefore the question, the experiment, and the answer, are immediately published IN the Web
Every component of the model
Every component of the input data
Every component of the output data
is a URL
The answer, and the knowledge derived from it, is immediately available to search engines
and moreover, can affect the outcome of other Web Science experiments
You Are NowHere!!!
Final thoughts
An experiment... based on a hypothesis
An experiment... based on a hypothesis
now modeled in OWL
Does this OWL Class represent the Hypothesis?
I think it does!
We modeled the answer......but the answer was hypothetical
Change the way we think of “hypotheses”
In Web Science 2.0
Model what the world would “look like”if your hypothesis were true
Then ask “is there any data that fits that model?”
Like the blind men examining an elephant
Seemingly different aspects of researchwhen viewed from the perspective of Web Science
become the same “thing”
The Model
Our vision of Web Science 2.0
Hypothesis
Ontology
Query
Workflow
Result
These can be automatically derived through provenance information during workflow execution
Materials & Methods
Please join us!
SADI and SHARE are Open-Source projects
http://sadiframework.org
My New Home!
Luke McCarthy – Lead Dev.Everything...
Benjamin VanderValk SHARE & SADI & Experimental modeling & myHeath Button
Soroush Samadian Cardiovascular data modeling and queries
University of British Columbia
Edward Kawas SADI Service auto-generator
Ian WoodExperimental modeling project
U of New Brunswick
Dr. Chris BakerAlexandre Riazanov
Carleton University
Dr. Michel DumontierMarc-Alexandre NolinLeonid ChepelevSteve EtlingerNichaella KiethJose Cruz
C-BRASS Collaborators at other sites
Microsoft Research