Upload
noel-watts
View
242
Download
1
Embed Size (px)
Citation preview
Formal Structuring of Genomic Formal Structuring of Genomic KnowledgeKnowledge
Nigam ShahPostdoctoral Fellow, SMI
The ‘Understanding’ cycleThe ‘Understanding’ cycle
Formulate hypothesis
Store validated hypotheses
Design experiment to
test hypothesis
Get best possible match with data
Evaluate for consistency with
known information
Identify conflicts and suggest ‘corrections’
HyBrow assists in the tasks bound by the red outline
Walking along this cycle is “hard”Walking along this cycle is “hard”
* The way much of biology works is by applying prior knowledge (‘what is known’) for interpreting datasets rather than the application of a set of axioms that will elicit knowledge. (Stevens et al, 2000)
* We need to explicitly articulate ‘what is known’… that’s a problem with the current information overload.
* If we explicitly articulate ‘what is known’, in an organizing framework, it serves as a reference for integrating new data with prior knowledge.
* And increases our ability to fit the results into the “big picture”.
How can we make it easier?How can we make it easier?
If we design a framework for making statements or sets of statements, comprising a hypothesis, about biological processes and systematically examine a wide variety of datasets for evaluating them.
We can speed up the ‘understanding cycle’.
Events and Implicit claimsEvents and Implicit claims
An hypothesis is a statement about relationships (among objects) within a biological system.
Protein P induces transcription of gene X
An ‘event’ is a relationship between two biological entities, which we call ‘agents’.
Implicit claims that can be tested:
1. P is a transcription factor.
2. P is a transcriptional activator.
3. P is localized to the nucleus.
4. P can bind to the promoter of gene X
promoter | gene X promoter | gene X PP
Components of a formal representationComponents of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual framework and the ontology
Domain information and knowledge structured into the knowledge model
Knowledgebase
Curated data = information. Large amount of information is created & stored by model organism databases
Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D
atabase
The conceptual frameworkThe conceptual framework
The terminal symbols – which cannot be further decomposed in a grammar – are supplied by the hypothesis ontology.
This grammar together with the hypothesis ontology, allows us to represent hypotheses in a formal language
Event → Subject.Verb.ObjectEvent → Subject.Verb.Object.ContextEvent → Subject.Verb.Object.Context.AssocCondSubject → (Actor | Context | Event)Verb → (Physical | Biochemical | Logical)Object → (Actor | Context | Event)Actor → (Gene | Protein | Complex …)Context → (Physical | Genetic | Temporal)AssocCond → (Presence of | absence of).Agent
We have specified methods to evaluate formal language hypotheses for internal consistency & agreement with existing knowledge.
The conceptual frameworkThe conceptual framework
Consistency of an hypothesis with prior knowledge is evaluated by applying constraints and rules.
A constraint is a statement specifying the evidence that contradicts or supports an event.
A protein must be in the nucleus to bind to a promoter.
A rule comprises the ‘steps’ for deciding whether a constraint is satisfied or violated.
Binds_to_promoter [P, g]
:
Annotation constraintsif cellular location of P is not nucleus, give a penalty.if biological process is not transcription, give a penalty.
Components of a formal representationComponents of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual framework and the ontology
Domain information and knowledge structured into the knowledge model
Knowledgebase
Curated data = information. Large amount of information is created & stored by model organism databases
Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D
atabase
Hypothesis OntologyHypothesis Ontology
Expressive enough to describe the galactose system at a coarse level of detail.
It is compatible with other ontology efforts. E.g. GO so that GO annotations
can be used directly in HyBrow.
We have also developed a grammar to write hypotheses using events from this ontology.
The grammar is presented in the Backus-Naur Form syntax1 hypothesis : eventstream ; eventstream : event | event STREAM_OP event | eventstream LOGIC_OP eventstream | eventstream STREAM_OP event | LPAREN eventstream RPAREN ; event : EVENT_NAME | EVENT_NAME EQUALS event | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT
SYNTAX_SUGAR AGENT ASSOC_OP | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT
SYNTAX_SUGAR PERT_CONT | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT
SYNTAX_SUGAR PERT_CONT SYNTAX_SUGAR AGENT ASSOC_OP ;
1 BNF is one of the most commonly used notations for specifying the syntax of programming languages, command sets, formal grammars and the like. It is widely used for language descriptions but seldom documented anywhere, so that it must usually be learned by using it.
Grammar for a hypothesisGrammar for a hypothesis
A hypothesis consists of at least one event stream
An event stream is a sequence of one or more events or event streams with logical joints (or operators) between them.
An event has exactly one agent_a, exactly one agent_b and exactly one operator (i.e. a relationship between the two agents). It also has a physical location that denotes ‘where’ the event happened, the genetic context of the organism and associated experimental perturbations when the event happened.
A logical joint is the conjunction between two event streams.
Components of a formal representationComponents of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual framework and the ontology
Domain information and knowledge structured into the knowledge model
Knowledgebase
Curated data = information. Large amount of information is created & stored by model organism databases
Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D
atabase
ConstraintsConstraints
A constraint is a statement specifying the evidence that supports or contradicts an event.
Types of constraints: Ontology Data Existence Temporal
X binds to promoter of Y
Ontology X must be a protein, complex;
Y must be a gene
Data X must be annotated to be
localized to the nucleus. The promoter of Y must have a
binding site for X;
Existence The gene for X must be
present
RulesRules
A rule decides whether a constraint is satisfied or violated.
A second layer of rules check the logical structure of the hypothesis
The first layer of rules enforce the constraints to decide support or conflict based on the data we have.
Components of a formal representationComponents of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual framework and the ontology
Domain information and knowledge structured into the knowledge model
Knowledgebase
Curated data = information. Large amount of information is created & stored by model organism databases
Data = generated by researchers. Not always accessible or available in a Model Organism Database (except sequence and microarray data) D
atabase
Proteomics
SequenceLiterature
Microarray
HyBrow KB
wt-gal wt+gal gal1+gal gal2+galGAL1 -2.892 -0.087 -1.993 -0.001GAL2 -0.822 -0.181 0.188 -0.59GAL3 -0.307 -0.05 -0.133 -0.268GAL4 0.688 -0.096 0.085 0.329PGM2 0.143 -0.11 -0.43 0.13LAP3 -0.108 0.013 0.377 -0.124GAL7 -2.606 -0.013 0.147 0.176GAL9 -2.427 -0.062 0.072 -0.105GAL80 -0.508 -0.037 -0.072 -0.286
Processed data
Inferences from data
protein_nameratio methodgal1p 1.143 ICATgal10p 1.067 ICATgal2p 0.858 ICATgal7p 1.122 ICATgal5p 0.269 ICATgcy1p 0.144 ICATacc1p -0.035 ICATtup1p 0.173 ICAT
MSMS
“GAL4 and the negative regulator GAL80 are constitutively expressed at low levels. Elevated GAL4 levels produce enough GAL4p to occupy the structural gene UASg elements.
In galactose, GAL4p can activate structural gene expression via the relaxation of the inhibitory function of GAL80 in the promoter-bound constitutive GAL4/GAL80 complex via the binding of GAL3…”
The knowledgebaseThe knowledgebase
User interfacesUser interfaces
Hypothesis described in Natural Language
Biological process described in a formal language
Evaluating an hypothesisEvaluating an hypothesis
Evaluating an hypothesisEvaluating an hypothesis
Inference rules
Event Handler
Justification routines
Neighboring events generator
Hypothesis parser and ranking rules
Result formatter
Visual Widget
Hypothesis file
Browser
User
Database
Screen shot of the outputScreen shot of the output
n1 b1
n1 b1
A list of events in the submitted hypothesis
A plot of the counts of support and conflicts
An explanation for each support / conflict with a link to the data source
HyBrow: take homeHyBrow: take home
The minimum requirement for a formal representation: Ability to represent data information Knowledge A language to express your “thought experiment” (your
model, hypothesis, theory, theorem etc) A reasoning framework to evaluate the outcome/
validity/accuracy of your thought experiment
We should not aim to use all the data and come up with ONE model that explains everything. It is much better to propose a model and examine if your
data supports/contradicts it
A clinical exampleA clinical example
Autism is a developmental disability characterized by “severe and pervasive impairment in several areas of development.”
Nutrigenomics is gathering a lot of attention in Autism treatmentDAN! (defeat autism now!) researchers sometimes
refer to this as “biomedical treatment”
Tests for deciding the optimal nutrigenomics therapy are costly and hard to interpret
Excerpt from a parent’s emailExcerpt from a parent’s email
…right now, that is a manual process to relate the genetic (mutation info...) and any microbial inputs to a biochemical pathway diagram and relate the mutations to specific supplement or enzyme therapies. It costs > $1000 and 6-8 months for someone to manually interpret the results.
I was wondering if it would be helpful to develop a model to contain the static/known information and some dynamic models to help answer some interesting questions relevant to the person's data.
This might make it possible to develop tools for a physician or motivated individual to use nutrigenomic information.
Credits and acknowledgementsCredits and acknowledgements
Stephen RacunasCo-developer of
HyBrow
FundingNIH
OrgnanonOrgnanon
an Organon, an instrument for the proper conduct and representation of scientific research. The first Organon was written by the Ancient Greek
philosopher Aristotle in the 4th Century B.C., and included his works on logic and the theory of science.[1]
The second great Organon, the Novum Organum (1620) of Francis Bacon was written as an update, extension and correction of the Aristotelian Organon in light of the success and experimental methods of post-Galilean modern natural science almost 2000 years latter.[2]
[1] The works known as Aristotle’s Organon can be found in The Complete Works of Aristotle, Two Volumes (Jonathan Barnes ed.). Princeton: Princeton University Press, 1984.
[2] Bacon, F. Novum Organum (Urback, P. and Gibson, J. transl. and eds.). Chicago: Open Court, 1994.