25
Formal Structuring of Genomic Formal Structuring of Genomic Knowledge Knowledge Nigam Shah Postdoctoral Fellow, SMI [email protected]

Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI [email protected]

Embed Size (px)

Citation preview

Page 1: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Formal Structuring of Genomic Formal Structuring of Genomic KnowledgeKnowledge

Nigam ShahPostdoctoral Fellow, SMI

[email protected]

Page 2: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

The ‘Understanding’ cycleThe ‘Understanding’ cycle

Formulate hypothesis

Store validated hypotheses

Design experiment to

test hypothesis

Get best possible match with data

Evaluate for consistency with

known information

Identify conflicts and suggest ‘corrections’

HyBrow assists in the tasks bound by the red outline

Page 3: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Walking along this cycle is “hard”Walking along this cycle is “hard”

* The way much of biology works is by applying prior knowledge (‘what is known’) for interpreting datasets rather than the application of a set of axioms that will elicit knowledge. (Stevens et al, 2000)

* We need to explicitly articulate ‘what is known’… that’s a problem with the current information overload.

* If we explicitly articulate ‘what is known’, in an organizing framework, it serves as a reference for integrating new data with prior knowledge.

* And increases our ability to fit the results into the “big picture”.

Page 4: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

How can we make it easier?How can we make it easier?

If we design a framework for making statements or sets of statements, comprising a hypothesis, about biological processes and systematically examine a wide variety of datasets for evaluating them.

We can speed up the ‘understanding cycle’.

Page 5: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Events and Implicit claimsEvents and Implicit claims

An hypothesis is a statement about relationships (among objects) within a biological system.

Protein P induces transcription of gene X

An ‘event’ is a relationship between two biological entities, which we call ‘agents’.

Implicit claims that can be tested:

1. P is a transcription factor.

2. P is a transcriptional activator.

3. P is localized to the nucleus.

4. P can bind to the promoter of gene X

promoter | gene X promoter | gene X PP

Page 6: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Components of a formal representationComponents of a formal representation

Formal representation

Domain knowledge model (Ontology)

Conceptual framework

Establish a correspondence between the conceptual framework and the ontology

Domain information and knowledge structured into the knowledge model

Knowledgebase

Curated data = information. Large amount of information is created & stored by model organism databases

Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D

atabase

Page 7: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

The conceptual frameworkThe conceptual framework

The terminal symbols – which cannot be further decomposed in a grammar – are supplied by the hypothesis ontology.

This grammar together with the hypothesis ontology, allows us to represent hypotheses in a formal language

Event → Subject.Verb.ObjectEvent → Subject.Verb.Object.ContextEvent → Subject.Verb.Object.Context.AssocCondSubject → (Actor | Context | Event)Verb → (Physical | Biochemical | Logical)Object → (Actor | Context | Event)Actor → (Gene | Protein | Complex …)Context → (Physical | Genetic | Temporal)AssocCond → (Presence of | absence of).Agent

We have specified methods to evaluate formal language hypotheses for internal consistency & agreement with existing knowledge.

Page 8: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

The conceptual frameworkThe conceptual framework

Consistency of an hypothesis with prior knowledge is evaluated by applying constraints and rules.

A constraint is a statement specifying the evidence that contradicts or supports an event.

A protein must be in the nucleus to bind to a promoter.

A rule comprises the ‘steps’ for deciding whether a constraint is satisfied or violated.

Binds_to_promoter [P, g]

:

Annotation constraintsif cellular location of P is not nucleus, give a penalty.if biological process is not transcription, give a penalty.

Page 9: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Components of a formal representationComponents of a formal representation

Formal representation

Domain knowledge model (Ontology)

Conceptual framework

Establish a correspondence between the conceptual framework and the ontology

Domain information and knowledge structured into the knowledge model

Knowledgebase

Curated data = information. Large amount of information is created & stored by model organism databases

Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D

atabase

Page 10: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Hypothesis OntologyHypothesis Ontology

Expressive enough to describe the galactose system at a coarse level of detail.

It is compatible with other ontology efforts. E.g. GO so that GO annotations

can be used directly in HyBrow.

We have also developed a grammar to write hypotheses using events from this ontology.

Page 11: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

The grammar is presented in the Backus-Naur Form syntax1 hypothesis : eventstream ; eventstream : event | event STREAM_OP event | eventstream LOGIC_OP eventstream | eventstream STREAM_OP event | LPAREN eventstream RPAREN ; event : EVENT_NAME | EVENT_NAME EQUALS event | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT

SYNTAX_SUGAR AGENT ASSOC_OP | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT

SYNTAX_SUGAR PERT_CONT | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT

SYNTAX_SUGAR PERT_CONT SYNTAX_SUGAR AGENT ASSOC_OP ;

1 BNF is one of the most commonly used notations for specifying the syntax of programming languages, command sets, formal grammars and the like. It is widely used for language descriptions but seldom documented anywhere, so that it must usually be learned by using it.

Grammar for a hypothesisGrammar for a hypothesis

A hypothesis consists of at least one event stream

An event stream is a sequence of one or more events or event streams with logical joints (or operators) between them.

An event has exactly one agent_a, exactly one agent_b and exactly one operator (i.e. a relationship between the two agents). It also has a physical location that denotes ‘where’ the event happened, the genetic context of the organism and associated experimental perturbations when the event happened.

A logical joint is the conjunction between two event streams.

Page 12: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Components of a formal representationComponents of a formal representation

Formal representation

Domain knowledge model (Ontology)

Conceptual framework

Establish a correspondence between the conceptual framework and the ontology

Domain information and knowledge structured into the knowledge model

Knowledgebase

Curated data = information. Large amount of information is created & stored by model organism databases

Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D

atabase

Page 13: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

ConstraintsConstraints

A constraint is a statement specifying the evidence that supports or contradicts an event.

Types of constraints: Ontology Data Existence Temporal

X binds to promoter of Y

Ontology X must be a protein, complex;

Y must be a gene

Data X must be annotated to be

localized to the nucleus. The promoter of Y must have a

binding site for X;

Existence The gene for X must be

present

Page 14: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

RulesRules

A rule decides whether a constraint is satisfied or violated.

A second layer of rules check the logical structure of the hypothesis

The first layer of rules enforce the constraints to decide support or conflict based on the data we have.

Page 15: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Components of a formal representationComponents of a formal representation

Formal representation

Domain knowledge model (Ontology)

Conceptual framework

Establish a correspondence between the conceptual framework and the ontology

Domain information and knowledge structured into the knowledge model

Knowledgebase

Curated data = information. Large amount of information is created & stored by model organism databases

Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D

atabase

Page 16: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Proteomics

SequenceLiterature

Microarray

HyBrow KB

wt-gal wt+gal gal1+gal gal2+galGAL1 -2.892 -0.087 -1.993 -0.001GAL2 -0.822 -0.181 0.188 -0.59GAL3 -0.307 -0.05 -0.133 -0.268GAL4 0.688 -0.096 0.085 0.329PGM2 0.143 -0.11 -0.43 0.13LAP3 -0.108 0.013 0.377 -0.124GAL7 -2.606 -0.013 0.147 0.176GAL9 -2.427 -0.062 0.072 -0.105GAL80 -0.508 -0.037 -0.072 -0.286

Processed data

Inferences from data

protein_nameratio methodgal1p 1.143 ICATgal10p 1.067 ICATgal2p 0.858 ICATgal7p 1.122 ICATgal5p 0.269 ICATgcy1p 0.144 ICATacc1p -0.035 ICATtup1p 0.173 ICAT

MSMS

“GAL4 and the negative regulator GAL80 are constitutively expressed at low levels. Elevated GAL4 levels produce enough GAL4p to occupy the structural gene UASg elements.

In galactose, GAL4p can activate structural gene expression via the relaxation of the inhibitory function of GAL80 in the promoter-bound constitutive GAL4/GAL80 complex via the binding of GAL3…”

The knowledgebaseThe knowledgebase

Page 17: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

User interfacesUser interfaces

Hypothesis described in Natural Language

Biological process described in a formal language

Page 18: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Evaluating an hypothesisEvaluating an hypothesis

Page 19: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Evaluating an hypothesisEvaluating an hypothesis

Inference rules

Event Handler

Justification routines

Neighboring events generator

Hypothesis parser and ranking rules

Result formatter

Visual Widget

Hypothesis file

Browser

User

Database

Page 20: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Screen shot of the outputScreen shot of the output

n1 b1

n1 b1

A list of events in the submitted hypothesis

A plot of the counts of support and conflicts

An explanation for each support / conflict with a link to the data source

Page 21: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

HyBrow: take homeHyBrow: take home

The minimum requirement for a formal representation: Ability to represent data information Knowledge A language to express your “thought experiment” (your

model, hypothesis, theory, theorem etc) A reasoning framework to evaluate the outcome/

validity/accuracy of your thought experiment

We should not aim to use all the data and come up with ONE model that explains everything. It is much better to propose a model and examine if your

data supports/contradicts it

Page 22: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

A clinical exampleA clinical example

Autism is a developmental disability characterized by “severe and pervasive impairment in several areas of development.”

Nutrigenomics is gathering a lot of attention in Autism treatmentDAN! (defeat autism now!) researchers sometimes

refer to this as “biomedical treatment”

Tests for deciding the optimal nutrigenomics therapy are costly and hard to interpret

Page 23: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Excerpt from a parent’s emailExcerpt from a parent’s email

…right now, that is a manual process to relate the genetic (mutation info...) and any microbial inputs to a biochemical pathway diagram and relate the mutations to specific supplement or enzyme therapies. It costs > $1000 and 6-8 months for someone to manually interpret the results.

I was wondering if it would be helpful to develop a model to contain the static/known information and some dynamic models to help answer some interesting questions relevant to the person's data.

This might make it possible to develop tools for a physician or motivated individual to use nutrigenomic information.

Page 24: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

Credits and acknowledgementsCredits and acknowledgements

Stephen RacunasCo-developer of

HyBrow

FundingNIH

Page 25: Formal Structuring of Genomic Knowledge Nigam Shah Postdoctoral Fellow, SMI nigam@stanford.edu

OrgnanonOrgnanon

an Organon, an instrument for the proper conduct and representation of scientific research. The first Organon was written by the Ancient Greek

philosopher Aristotle in the 4th Century B.C., and included his works on logic and the theory of science.[1]

The second great Organon, the Novum Organum (1620) of Francis Bacon was written as an update, extension and correction of the Aristotelian Organon in light of the success and experimental methods of post-Galilean modern natural science almost 2000 years latter.[2]

[1] The works known as Aristotle’s Organon can be found in The Complete Works of Aristotle, Two Volumes (Jonathan Barnes ed.). Princeton: Princeton University Press, 1984.

[2] Bacon, F. Novum Organum (Urback, P. and Gibson, J. transl. and eds.). Chicago: Open Court, 1994.