129
Web Science 2.0 Conducting in silico research in the Web from hypothesis to publication Mark Wilkinson Isaac Peral Senior Researcher in Biological Informatics Centro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain Adjunct Professor of Medical Genetics, University of British Columbia Vancouver, BC, Canada.

Web Science 2.0 - in silico science

Embed Size (px)

DESCRIPTION

the same story as usual, but with a bit more context (why it is absolutely necessary to move science in this direction). Presented to University of Potsdam, Germany, and the University of New Brunswick, Canada in December, 2012.

Citation preview

Page 1: Web Science 2.0 - in silico science

Web Science 2.0

Conducting in silico research in the Webfrom hypothesis to publication

Mark Wilkinson

Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain

Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

Page 2: Web Science 2.0 - in silico science

Context

Multiple recent surveys of high-throughput biology

reveal that upwards of 50% of published studies

are not reproducible

- Baggerly, 2009

- Ioannidis, 2009

Page 3: Web Science 2.0 - in silico science

Context

“the most common errors are simple,

the most simple errors are common”

- Baggerly, 2009

Page 4: Web Science 2.0 - in silico science

Context

These errors pass peer review

The researcher is unaware of the error

The process that led to the error is not recorded

Therefore it cannot be detected during peer-review

Page 5: Web Science 2.0 - in silico science

Context

Discovery of such errors have resulted in retractions

and even shut-down clinical trials

Page 6: Web Science 2.0 - in silico science

Context

In March, 2012, the US Institute of Medicine said

“Enough is enough!”

Page 7: Web Science 2.0 - in silico science

ContextInstitute of Medicine Recommendations

For Conduct of High-Throughput Research:

Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.

1. Rigorously-described, -annotated, and -followed data management procedures

2. “Lock down” the computational analysis pipeline once it has been selected

3. Publish the workflow in a formal manner, together with the full starting and result datasets

Page 8: Web Science 2.0 - in silico science

Achieving these recommendations

requires integration of existing technologies

and invention of new ones

Page 9: Web Science 2.0 - in silico science

“While it took 2,300 years after the first report of angina for the condition to be commonly taught in medical curricula, modern discoveries are being disseminated at an increasingly rapid pace.”

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009

Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

Context

Page 10: Web Science 2.0 - in silico science

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

“The Singularity”

The X-intercept is where, the moment a discovery is made, it is immediately put into practice

(not only medical practice, but any research endeavour...)

Page 11: Web Science 2.0 - in silico science

The technology required to achieve this does not yet exist

Page 12: Web Science 2.0 - in silico science

Scientific research would have to be conducted within a medium that

immediately interpreted and disseminated the results...

You Are

Here

Page 13: Web Science 2.0 - in silico science

...in a form that immediately (actively!) affected the research of others...

You Are

Here

Page 14: Web Science 2.0 - in silico science

...without requiring them to be aware of these new discoveries.

You Are

Here

Page 15: Web Science 2.0 - in silico science

I’d like to show you how close we now are to this vision

and how we got there

Page 16: Web Science 2.0 - in silico science

Web Science 2.0

Page 17: Web Science 2.0 - in silico science

We wanted to duplicatea real, peer-reviewed, bioinformatics analysis

simply by building a model in the Webdescribing what the answer

(if one existed)

would look like

Page 18: Web Science 2.0 - in silico science

...the machine had to make every other decision

on it’s own

Page 19: Web Science 2.0 - in silico science

Brief Digression

“in” the Web??

Page 20: Web Science 2.0 - in silico science

How we use The Web today

Page 21: Web Science 2.0 - in silico science
Page 22: Web Science 2.0 - in silico science

By clicking here you cause this incredibly powerful computational tool called The Web to retrieve a chunk of text and images that

can only be understood by a human...

Page 23: Web Science 2.0 - in silico science

The Web is not a pigeon!

Page 24: Web Science 2.0 - in silico science

To achieve this vision

We must learn how to do research IN the Web

Not OVER the Web

Page 25: Web Science 2.0 - in silico science

Resume Speed

Page 26: Web Science 2.0 - in silico science

We wanted to duplicatea real, peer-reviewed, bioinformatics analysis

simply by building a model in the Webdescribing what the answer

(if one existed)

would look like

Page 27: Web Science 2.0 - in silico science

...the machine had to make every other decision

on it’s own

Page 28: Web Science 2.0 - in silico science

This is the study we chose:

Page 29: Web Science 2.0 - in silico science

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

Page 30: Web Science 2.0 - in silico science

Original Study Simplified

Using what is known about interactions in fly & yeast

predict new interactions with your human protein of interest

Page 31: Web Science 2.0 - in silico science

Given a protein P in Species X

Find proteins similar to P in Species Y

Retrieve interactors in Species Y

Sequence-compare Y-interactors with Species X genome

(1) Keep only those with homologue in X

Find proteins similar to P in Species Z

Retrieve interactors in Species Z

Sequence-compare Z-interactors with (1)

Putative interactors in Species X

“Pseudo-code” Abstracted Workflow

Page 32: Web Science 2.0 - in silico science

Modeling the answer...

OWL

Web Ontology Language (OWL) is the language approved by the W3C

for representing knowledge in the Web

Page 33: Web Science 2.0 - in silico science

Modeling the answer...

Note that every word in this diagram is, in reality, a URL (because it is OWL)

The model of the answer is published in The Weband borrows ideas from other models published in The Web

Page 34: Web Science 2.0 - in silico science

ProbableInteractor is homologous to ( Potential Interactor from ModelOrganism1…)

and

Potential Interactor from ModelOrganism2…)

Probable Interactor is defined in OWL as a subclass of Potential Interactor that requires homologous pairs of interacting proteins to exist in both

comparator model organisms.

(Effectively, an intersection)

Modeling the answer...

Page 35: Web Science 2.0 - in silico science

Publish our OWL model of a Probable Interactor

in the Web

Page 36: Web Science 2.0 - in silico science

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly

Running the Web Science Experiment

Page 37: Web Science 2.0 - in silico science

The tricky bit is...

In the abstract, the search for homology is

“generic” – ANY Protein, ANY model

system

But when the machine does the experiment, it

must use specific of resources because the

answer requires information from two

declared species

taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly

Page 38: Web Science 2.0 - in silico science

PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>

SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE {

?protein a i:ProbableInteractor .}

This is the question we ask:(the query language here is SPARQL)

The reference (URL) to our OWL model of the answer

Page 39: Web Science 2.0 - in silico science

Our system then derives (and executes) the following workflow automatically

These are differentWeb services!

...selected at run-time based on the same model

Page 40: Web Science 2.0 - in silico science
Page 41: Web Science 2.0 - in silico science

There are four very cool things about what you just saw...

Page 42: Web Science 2.0 - in silico science

There are four very cool things about what you just saw...

The system was able to create a workflow based on an OWL model (ontology)

Page 43: Web Science 2.0 - in silico science

There are four very cool things about what you just saw...

The system was able to create a COMPUTATIONAL workflow

based on a BIOLOGICAL model

Page 44: Web Science 2.0 - in silico science

There are four very cool things about what you just saw...

The workflow it created (i.e. the services chosen)

differed depending on context

Page 45: Web Science 2.0 - in silico science

There are four very cool things about what you just saw...

The choice of tool-selection was guidedby the encoded knowledge of domain-experts

worldwide

Page 46: Web Science 2.0 - in silico science

We got the answer

“simply” by designing a model of the answer!

Page 47: Web Science 2.0 - in silico science

How did we do that?

Page 48: Web Science 2.0 - in silico science

A “Smart” Biomedical Resource Representation System

Page 49: Web Science 2.0 - in silico science

A Web application that answers SPARQL-DL queries

Query-answering Enhanced by SADI

Page 50: Web Science 2.0 - in silico science

Demo #1

Page 51: Web Science 2.0 - in silico science

Imagine a “virtual database”

all of the data from all databases

+result of

every conceivable analysis

Page 52: Web Science 2.0 - in silico science

How can we query that database?

Page 53: Web Science 2.0 - in silico science
Page 54: Web Science 2.0 - in silico science

What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .

?image info:hasDescription ?desc }

Page 55: Web Science 2.0 - in silico science

What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .

?image info:hasDescription ?desc }

Note that there is no “FROM” clause!We don’t tell it where it should get the information, The machine has to figure that out by itself...

Page 56: Web Science 2.0 - in silico science

Enter that query into SHARE

Page 57: Web Science 2.0 - in silico science

Click “Submit”...

Page 58: Web Science 2.0 - in silico science

...and in a few seconds you get your answer.

Page 59: Web Science 2.0 - in silico science

The query results are live hyperlinksto the respective Database or images

Page 60: Web Science 2.0 - in silico science

Neither SADI nor SHARE

know anything about

plant biology or genetics

Page 61: Web Science 2.0 - in silico science

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

Page 62: Web Science 2.0 - in silico science

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

Page 63: Web Science 2.0 - in silico science

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

Note again that there is no “From” clause…

I have not told SHARE where to look for the answer, I am simply asking my question

Page 64: Web Science 2.0 - in silico science

Enter that query into SHARE

Page 65: Web Science 2.0 - in silico science
Page 66: Web Science 2.0 - in silico science
Page 67: Web Science 2.0 - in silico science

Two different providers of gene information (KEGG & NCBI); were found & accessed

Two different providers of pathway information (KEGG and GO); were found & accessed

Page 68: Web Science 2.0 - in silico science

The results are all links to the original data

Page 69: Web Science 2.0 - in silico science

Neither SADI nor SHARE

know anything about

proteins or biochemical pathways

Page 70: Web Science 2.0 - in silico science

Recapwhat we just saw

We posed, and answered ~complex multi-database queries

WITHOUT A DATA WAREHOUSE

Page 71: Web Science 2.0 - in silico science

An example from the Clinical domain

Demo #2

Page 72: Web Science 2.0 - in silico science

Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Page 73: Web Science 2.0 - in silico science

Show me the latest Blood Urea Nitrogen (BUN) and Creatinine levels of patients who appear to be

rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Page 74: Web Science 2.0 - in silico science

Likely Rejecter:

A patient who has creatinine levelsthat are increasing over time

- - Mark D Wilkinson’s definition

Page 75: Web Science 2.0 - in silico science

Likely Rejecter:

…but there is no “likely rejecter” column or table in our database…

only blood chemistry measurementsat various time-points

Page 76: Web Science 2.0 - in silico science

Likely Rejecter:

So the data required to answer this questionDOESN’T EXIST!

Page 77: Web Science 2.0 - in silico science

?

Page 78: Web Science 2.0 - in silico science

Enter that query into SHARE

Page 79: Web Science 2.0 - in silico science

Now…

Two “magical” events occur…

Page 80: Web Science 2.0 - in silico science

The machine decides

by itself

that it needs to do a Linear Regression analysis

on the blood creatinine measurementsin order to answer your question

Page 81: Web Science 2.0 - in silico science

The machine decides

by itself

how and where that analysiscan be done

and does it automatically!

Page 82: Web Science 2.0 - in silico science

http://www.impactlab.net/2009/03/22/improve-your-brain-power/

Page 83: Web Science 2.0 - in silico science

The SHARE system utilizes SADI to discover analytical services on the Web that do linear regression analysis

and sends the data to be analysed

Page 84: Web Science 2.0 - in silico science

VOILA!

Page 85: Web Science 2.0 - in silico science

Neither SADI nor SHARE

know anything about

blood chemistry, or mathematics

Page 86: Web Science 2.0 - in silico science

So how does the machine know what to do??

Page 87: Web Science 2.0 - in silico science

Ontologies

Page 88: Web Science 2.0 - in silico science

Ontologies explicitly define the kinds of things that (can) exist…

…and what those things are “like”

i.e. what properties they have (color, weight, shape, texture, temperature, “state”)

and what relationships they have to one another (inside-of, adjacent-to, part-of, binds-to, controls, inhibits,

degrades, etc.)

Page 89: Web Science 2.0 - in silico science

So we create ………….ontologies about biology

and health

We* publish them on the Web

* We… or anybody! Anybody can publish an ontology!

Page 90: Web Science 2.0 - in silico science

My definition of a Likely Rejecter is encoded in a machine-readable document written in the OWL Ontology language

Basically:

“the regression line over creatinine measurements should have an increasing slope”

Page 91: Web Science 2.0 - in silico science

Our ontology refers to other ontologies (possibly published by other people)to learn about what the properties of “regression models” are

e.g. that regression models have slopes and intercepts and that slopes and intercepts have decimal values

Page 92: Web Science 2.0 - in silico science

SHARE examines the query

Looks on the Web for ontologies that describe the problem it is trying to solve,

and “reads” them

then uses that “knowledge” to figure out which data-sources and analytical tools it

needs to answer the query

Page 93: Web Science 2.0 - in silico science

The way SHARE “interprets” data varies depending on the context of the query

(i.e. which ontologies it reads – Mine? Yours?)

and on what part of the query it is trying to answer at any given moment

(which ontological concept is relevant to that clause)

Page 94: Web Science 2.0 - in silico science

Data exhibits “late binding”

Page 95: Web Science 2.0 - in silico science

Late binding:

“purpose and meaning”of the data is

not determined untilthe moment it is required

a.k.a The “semantics” of the data

Page 96: Web Science 2.0 - in silico science

Benefitof late binding

Data is amenable toconstant re-interpretation

Page 97: Web Science 2.0 - in silico science

Example?

Blood Creatinine measurements

were not dictated to be (only)

Blood Creatinine measurements

Page 98: Web Science 2.0 - in silico science

Example?

The data had the ‘qualities/properties’ that

allowed one machine to interpret

that they were Blood Creatinine measurements

(e.g. to determine which patients were rejecting)

Page 99: Web Science 2.0 - in silico science

Example?

But the data also had the ‘qualities/properties’ that

allowed another machine to interpret them as

Simple X/Y coordinate data

(e.g. the Linear Regression calculation tool)

Page 100: Web Science 2.0 - in silico science

Benefitof late binding

Data is amenable toconstant re-interpretation

Page 101: Web Science 2.0 - in silico science

http://www.flickr.com/people/faernworks/

Page 102: Web Science 2.0 - in silico science

And that brings us to...

Page 103: Web Science 2.0 - in silico science

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

Page 104: Web Science 2.0 - in silico science

We built a model of the proposed answer

Page 105: Web Science 2.0 - in silico science

Our system converted the model into the experiment

Page 106: Web Science 2.0 - in silico science

The analytical tools chosen for that experiment changed depending on

context

even though the biological model driving their selection was the same

Page 107: Web Science 2.0 - in silico science

i.e.

The published model is re-usable

Page 108: Web Science 2.0 - in silico science

i.e.

The published model is re-usable

In different contexts... By different researchers

Page 109: Web Science 2.0 - in silico science

and because the model IS the experiment

the published EXPERIMENT is re-usable!!

simply point the same query at your own dataset...

Page 110: Web Science 2.0 - in silico science

The publication is an executable document!

Page 111: Web Science 2.0 - in silico science

Every component of the model

Every component of the input data

Every component of the output data

is a URL

Therefore the question, the experiment, and the answer, are immediately published IN the Web

Page 112: Web Science 2.0 - in silico science

Every component of the model

Every component of the input data

Every component of the output data

is a URL

The answer, and the knowledge derived from it, is immediately available to search engines

and moreover, can affect the outcome of other Web Science experiments

Page 113: Web Science 2.0 - in silico science
Page 114: Web Science 2.0 - in silico science

You Are NowHere!!!

Page 115: Web Science 2.0 - in silico science

Final thoughts

Page 116: Web Science 2.0 - in silico science

An experiment... based on a hypothesis

Page 117: Web Science 2.0 - in silico science

An experiment... based on a hypothesis

now modeled in OWL

Page 118: Web Science 2.0 - in silico science

Does this OWL Class represent the Hypothesis?

I think it does!

Page 119: Web Science 2.0 - in silico science

We modeled the answer......but the answer was hypothetical

Page 120: Web Science 2.0 - in silico science

Change the way we think of “hypotheses”

Page 121: Web Science 2.0 - in silico science

In Web Science 2.0

Model what the world would “look like”if your hypothesis were true

Then ask “is there any data that fits that model?”

Page 122: Web Science 2.0 - in silico science

Like the blind men examining an elephant

Seemingly different aspects of researchwhen viewed from the perspective of Web Science

become the same “thing”

The Model

Page 123: Web Science 2.0 - in silico science

Our vision of Web Science 2.0

Hypothesis

Ontology

Query

Workflow

Result

These can be automatically derived through provenance information during workflow execution

Materials & Methods

Page 124: Web Science 2.0 - in silico science
Page 125: Web Science 2.0 - in silico science

Please join us!

SADI and SHARE are Open-Source projects

http://sadiframework.org

Page 126: Web Science 2.0 - in silico science

My New Home!

Page 127: Web Science 2.0 - in silico science

Luke McCarthy – Lead Dev.Everything...

Benjamin VanderValk SHARE & SADI & Experimental modeling & myHeath Button

Soroush Samadian Cardiovascular data modeling and queries

University of British Columbia

Edward Kawas SADI Service auto-generator

Ian WoodExperimental modeling project

Page 128: Web Science 2.0 - in silico science

U of New Brunswick

Dr. Chris BakerAlexandre Riazanov

Carleton University

Dr. Michel DumontierMarc-Alexandre NolinLeonid ChepelevSteve EtlingerNichaella KiethJose Cruz

C-BRASS Collaborators at other sites

Page 129: Web Science 2.0 - in silico science

Microsoft Research