Comparing three models of scientific discourse annotation
for enhanced knowledge extraction
Anita de Waard, Maria Liakata, Paul Thompson, Raheel Nawaz and Sophia Ananiadou
Accessing the knowledge in papers
Accessing the knowledge in papers
- Papers are ‘Stories that persuade with data’
Accessing the knowledge in papers
- Papers are ‘Stories that persuade with data’
- So how is this persuasion done? Three ways of annotating key rhetorical moves:
- Discourse segment types (de Waard, Elsevier/Utrecht)
- Zones of conceptualisation using Core Scientific Concepts (Liakata, Aberystwyth/EBI)
- Metaknowledge annotation of BioEvents (Thompson, Ananiadou et al, NACTeM/Manchester)
Accessing the knowledge in papers
- Papers are ‘Stories that persuade with data’
- So how is this persuasion done? Three ways of annotating key rhetorical moves:
- Discourse segment types (de Waard, Elsevier/Utrecht)
- Zones of conceptualisation using Core Scientific Concepts (Liakata, Aberystwyth/EBI)
- Metaknowledge annotation of BioEvents (Thompson, Ananiadou et al, NACTeM/Manchester)
- Comparison of 3 methods on full-text paper
Accessing the knowledge in papers
- Papers are ‘Stories that persuade with data’
- So how is this persuasion done? Three ways of annotating key rhetorical moves:
- Discourse segment types (de Waard, Elsevier/Utrecht)
- Zones of conceptualisation using Core Scientific Concepts (Liakata, Aberystwyth/EBI)
- Metaknowledge annotation of BioEvents (Thompson, Ananiadou et al, NACTeM/Manchester)
- Comparison of 3 methods on full-text paper
- What are overlaps/differences? Can we combine?
3
“Scientific articles are stories...The Story of Goldilocks and the Three Bears
Story Grammar Paper The AXH Domain of Ataxin-1 Mediates Neurodegeneration through Its Interaction with Gfi-1/Senseless Proteins
Once upon a time Time Setting Background The mechanisms mediating SCA1 pathogenesis are still not fully understood, but some general principles have emerged.
a little girl named Goldilocks Characters
Setting
Objects of study the Drosophila Atx-1 homolog (dAtx-1) which lacks a polyQ tract,
She went for a walk in the forest. Pretty soon, she came upon a house.
Location
Setting
Experimental setup
studied and compared in vivo effects and interactions to those of the human protein
She knocked and, when no one answered,
Goal Theme Researchgoal
Gain insight into how Atx-1's function contributes to SCA1 pathogenesis. How these interactions might contribute to the disease process and how they might cause toxicity in only a subset of neurons in SCA1 is not fully understood.
she walked right in. Attempt
Theme
Hypothesis Atx-1 may play a role in the regulation of gene expression
At the table in the kitchen, there were three bowls of porridge.
Name Episode 1 Name dAtX-1 and hAtx-1 Induce Similar Phenotypes When Overexpressed in Files
Goldilocks was hungry. Subgoal
Episode 1
Subgoal test the function of the AXH domain
She tasted the porridge from the first bowl.
Attempt
Episode 1
Method overexpressed dAtx-1 in flies using the GAL4/UAS system (Brand and Perrimon, 1993) and compared its effects to those of hAtx-1.
This porridge is too hot! she exclaimed.
Outcome
Episode 1
Results Although at 2 days after eclosion, overexpression of either Atx-1 does not show obvious morphological changes in the photoreceptor cells
So, she tasted the porridge from the second bowl.
Activity
Episode 1
Data (data not shown),
This porridge is too cold, she said
Outcome
Episode 1
Results both genotypes show many large holes and loss of cell integrity at 28 days
So, she tasted the last bowl of porridge.
Activity
Episode 1
Data (Figures 1B-1D).
Ahhh, this porridge is just right, she said happily and
Outcome
Episode 1
Results Overexpression of dAtx-1 using the GMR-GAL4 driver also induces eye abnormalities. The external structures of the eyes that overexpress dAtx-1 show disorganized ommatidia and loss of interommatidial bristles
she ate it all up.
Episode 1
Data (Figure 1F),
3
“Scientific articles are stories...The Story of Goldilocks and the Three Bears
Story Grammar Paper The AXH Domain of Ataxin-1 Mediates Neurodegeneration through Its Interaction with Gfi-1/Senseless Proteins
Once upon a time Time Setting Background The mechanisms mediating SCA1 pathogenesis are still not fully understood, but some general principles have emerged.
a little girl named Goldilocks Characters
Setting
Objects of study the Drosophila Atx-1 homolog (dAtx-1) which lacks a polyQ tract,
She went for a walk in the forest. Pretty soon, she came upon a house.
Location
Setting
Experimental setup
studied and compared in vivo effects and interactions to those of the human protein
She knocked and, when no one answered,
Goal Theme Researchgoal
Gain insight into how Atx-1's function contributes to SCA1 pathogenesis. How these interactions might contribute to the disease process and how they might cause toxicity in only a subset of neurons in SCA1 is not fully understood.
she walked right in. Attempt
Theme
Hypothesis Atx-1 may play a role in the regulation of gene expression
At the table in the kitchen, there were three bowls of porridge.
Name Episode 1 Name dAtX-1 and hAtx-1 Induce Similar Phenotypes When Overexpressed in Files
Goldilocks was hungry. Subgoal
Episode 1
Subgoal test the function of the AXH domain
She tasted the porridge from the first bowl.
Attempt
Episode 1
Method overexpressed dAtx-1 in flies using the GAL4/UAS system (Brand and Perrimon, 1993) and compared its effects to those of hAtx-1.
This porridge is too hot! she exclaimed.
Outcome
Episode 1
Results Although at 2 days after eclosion, overexpression of either Atx-1 does not show obvious morphological changes in the photoreceptor cells
So, she tasted the porridge from the second bowl.
Activity
Episode 1
Data (data not shown),
This porridge is too cold, she said
Outcome
Episode 1
Results both genotypes show many large holes and loss of cell integrity at 28 days
So, she tasted the last bowl of porridge.
Activity
Episode 1
Data (Figures 1B-1D).
Ahhh, this porridge is just right, she said happily and
Outcome
Episode 1
Results Overexpression of dAtx-1 using the GMR-GAL4 driver also induces eye abnormalities. The external structures of the eyes that overexpress dAtx-1 show disorganized ommatidia and loss of interommatidial bristles
she ate it all up.
Episode 1
Data (Figure 1F),
4
“...that persuade (reviewers/readers)…”
4
“...that persuade (reviewers/readers)…”Aristotle QuintilianQuintilian Scientific Paper
prooimion Introduction/ exordium
The introduction of a speech, where one announces the subject and purpose of the discourse, and where one usually employs the persuasive appeal to ethos in order to establish credibility with the audience.
Introduction: positioning
prothesis Statement of Facts/narratio
The speaker here provides a narrative account of what has happened and generally explains the nature of the case.
Introduction: research question
Summary/ propostitio
The propositio provides a brief summary of what one is about to speak on, or concisely puts forth the charges or accusation.
Summary of contents
pistis Proof/ confirmatio
The main body of the speech where one offers logical arguments as proof. The appeal to logos is emphasized here.
Results
Refutation/ refutatio
As the name connotes, this section of a speech was devoted to answering the counterarguments of one's opponent.
Related Work
epilogos peroratio Following the refutatio and concluding the classical oration, the peroratio conventionally employed appeals through pathos, and often included a summing up.
Discussion: summary, implications.
4
“...that persuade (reviewers/readers)…”Aristotle QuintilianQuintilian Scientific Paper
prooimion Introduction/ exordium
The introduction of a speech, where one announces the subject and purpose of the discourse, and where one usually employs the persuasive appeal to ethos in order to establish credibility with the audience.
Introduction: positioning
prothesis Statement of Facts/narratio
The speaker here provides a narrative account of what has happened and generally explains the nature of the case.
Introduction: research question
Summary/ propostitio
The propositio provides a brief summary of what one is about to speak on, or concisely puts forth the charges or accusation.
Summary of contents
pistis Proof/ confirmatio
The main body of the speech where one offers logical arguments as proof. The appeal to logos is emphasized here.
Results
Refutation/ refutatio
As the name connotes, this section of a speech was devoted to answering the counterarguments of one's opponent.
Related Work
epilogos peroratio Following the refutatio and concluding the classical oration, the peroratio conventionally employed appeals through pathos, and often included a summing up.
Discussion: summary, implications.
5
“... with data.”
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Method
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Method
Result
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Method
Result
Implication
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Method
Result
Implication
Goal
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Method
Result
Implication
Goal
Reg-Implication
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Method
Result
Implication
Goal
Reg-Implication
Conceptual knowledge
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the m i R - 3 7 1 - 3 e x p r e s s i n g s e m i n o m a s a n d nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 express ion i s a se lec t ive even t dur ing tumorigenesis.
Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.
Fact
Hypothesis
Method
Result
Implication
Goal
Reg-Implication
Conceptual knowledge
ExperimentalEvidence
Annotate: fine-grained models of argumentationMethod 1: Discourse Segment Types
Segment types point to realms of discourse:
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
(2) a. To exclude that
Goal
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
(2) a. To exclude that
Goal
(2) c. we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).
(3) a. In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),
Method Result
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
(2) a. To exclude that
Goal
(3) b. suggesting that
Regulatory-Implication
(2) c. we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).
(3) a. In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),
Method Result
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
(3) c. miR-371-3 expression is a selective event during tumorigenesis.
Implication
(2) a. To exclude that
Goal
(3) b. suggesting that
Regulatory-Implication
(2) c. we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).
(3) a. In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),
Method Result
Concepts, models, ‘facts’: Present tense
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
(3) c. miR-371-3 expression is a selective event during tumorigenesis.
Implication
(2) a. To exclude that
Goal
(3) b. suggesting that
Regulatory-Implication
(2) c. we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).
(3) a. In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),
Method Result
Concepts, models, ‘facts’: Present tense
Experiment: Past tense
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
(3) c. miR-371-3 expression is a selective event during tumorigenesis.
Implication
(2) a. To exclude that
Goal
(3) b. suggesting that
Regulatory-Implication
(2) c. we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).
(3) a. In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),
Method Result
Concepts, models, ‘facts’: Present tense
Experiment: Past tense
Transitions: present tense
Segment types point to realms of discourse:
(1) Both seminomas and the EC component of nonseminomas share features with ES cells.
(2) b. the detection of miR-371-3 merely reflects its expression pattern in ES cells,
Fact Problem
(3) c. miR-371-3 expression is a selective event during tumorigenesis.
Implication
(2) a. To exclude that
Goal
(3) b. suggesting that
Regulatory-Implication
(2) c. we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).
(3) a. In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),
Method Result
Method 2: Annotate with Core-Scientific Concepts (CoreSC) Annotation Scheme
s
Method 2: Annotate with Core-Scientific Concepts (CoreSC) Annotation Scheme
A three layer, ontology motivated annotation scheme for sentence annotation, which views a paper as the humanly readable representation of a scientific investigation [Liakata et al 2010], with 45-page guidelines [Liakata & Soldatova 2008]
1st layer: Core Scientific Concepts (CoreSCs): Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result, Conclusion
2nd layer: Properties of CoreSCs. Novelty (New/Old) and Advantage (advantage/disadvantage)
3rd layer: Concept Identifiers: linking sentences together which refer to the same instance of a CoreSC
s
CoreSC Annotation Scheme (layers 1&2)
CoreSC Annotation Scheme (layers 1&2)HypothesisMotivationBackgroundGoalObject-NewObject-New-AdvantageObject-New-DisadvantageMethod-NewMethod-New-AdvantageMethod-New-DisadvantageMethod-OldMethod-Old-DisadvantageMethod-Old-AdvantageExperimentModelObservationResultConclusion
A statement not yet confirmed rather than a factThe reasons behind an investigationBackground knowledge & previous workA target state of the investigationA main product or theme of the investigationAdvantage of an objectDisadvantage of an objectMeans by which the goals of the investigation are achievedAdvantage of a MethodDisadvantage of a MethodA method pertaining to previous workDisadvantage of method in previous workAdvantage of method in previous workAn experimental methodStatement about theoretical model, method or frameworkData/phenomena recorded in an investigationFactual statements about the results of an investigationStatements inferred from observations and results
CoreSC Annotation tool:
Method 3: Bio-event Annotation
- A dynamic biological rela0onship involving one or more par0cipants
Method 3: Bio-event Annotation
- A dynamic biological rela0onship involving one or more par0cipants
We found that Y ac.vates the expression of X
Method 3: Bio-event Annotation
- A dynamic biological rela0onship involving one or more par0cipants
We found that Y ac.vates the expression of X
Method 3: Bio-event Annotation
- A dynamic biological rela0onship involving one or more par0cipants
We found that Y ac.vates the expression of X
ID: E1
TRIGGER: expression
TYPE: GENE_EXPRESSION
THEME: X : gene
CAUSE: none (empty)
Method 3: Bio-event Annotation
- A dynamic biological rela0onship involving one or more par0cipants
We found that Y ac.vates the expression of X
ID: E1
TRIGGER: expression
TYPE: GENE_EXPRESSION
THEME: X : gene
CAUSE: none (empty)
Method 3: Bio-event Annotation
- A dynamic biological rela0onship involving one or more par0cipants
We found that Y ac.vates the expression of X
ID: E2
TRIGGER: ac3vates
TYPE: POSITIVE_REGULATION
THEME: E1 : event
CAUSE: Y : protein
ID: E1
TRIGGER: expression
TYPE: GENE_EXPRESSION
THEME: X : gene
CAUSE: none (empty)
Meta-Knowledge annotation scheme for BioEvents
Class / Type(Grounded to an event
ontology)
Knowledge Type• InvesHgaHon• ObservaHon• Analysis• General
Manner• High• Low• Neutral
Certainty Level•L3•L2•L1
Polarity• NegaHve• PosiHve
Source• Other• Current
ParHcipants• Theme(s)• Actor(s)
Bio-‐Event(Centred on an Event
Trigger)
Meta-Knowledge annotation scheme for BioEvents
Class / Type(Grounded to an event
ontology)
Knowledge Type• InvesHgaHon• ObservaHon• Analysis• General
Manner• High• Low• Neutral
Certainty Level•L3•L2•L1
Polarity• NegaHve• PosiHve
Source• Other• Current
ParHcipants• Theme(s)• Actor(s)
• Currently being applied to the entire GENIA event corpus (1000 MEDLINE abstracts)
Bio-‐Event(Centred on an Event
Trigger)
BioEvent/MetaKnowledge Annotation
S3 = These results suggest that Y has no effect on expression of X
BioEvent/MetaKnowledge Annotation
S3 = These results suggest that Y has no effect on expression of X
EventKnowledge
Type
Certainty
Level
Lexical
PolarityManner Source
E1 General L3 PosiHve Neutral Current
E2 Analysis L2 NegaHve Neutral Current
BioEvent/MetaKnowledge Annotation
S3 = These results suggest that Y has no effect on expression of X
EventKnowledge
Type
Certainty
Level
Lexical
PolarityManner Source
E1 General L3 PosiHve Neutral Current
E2 Analysis L2 NegaHve Neutral Current
BioEvent/MetaKnowledge Annotation
S3 = These results suggest that Y has no effect on expression of X
EventKnowledge
Type
Certainty
Level
Lexical
PolarityManner Source
E1 General L3 PosiHve Neutral Current
E2 Analysis L2 NegaHve Neutral Current
Name Purpose Granularity Manual/Automated
CoreSC Identify main components of scientific investigation for machine learning
Sentence Manual corpus, automated annotation tools
MetaKnowledge/BioEvents
Enhance information extraction for biomedical texts to enable metadiscourse annotation
Events (intra-sentential): can be several per sentence, or one in more sentences
Manual corpus, working on automated
Discourse Segment Types
Identify mechanisms of conveying (epistemic) knowledge in scientific discourse
Clause Manual
Comparing 3 annotating systems
3 Annotation Systems on the same paper:
3 Annotation Systems on the same paper:CoreSC:
<annotationART atype="GSC" type="Res" conceptID="Res24" novelty="None" advantage="None">Here we show that BOB.1/OBF.1 regulates Btk gene expression.</annotationART> BioEvent/MetaKnowledge:<sentence id="S6">Here we show that <term id="T13" sem="Protein_family_or_group">
<gene-or-gene-product id="G9">BOB.1</gene-or-gene-product>/<gene-or-gene-product id="G10">OBF.1</gene-or-gene-product>
</term> regulates <term id="T14" sem="Biological_process">
<term id="T15" sem="DNA_domain_or_region"><gene-or-gene-product id="G11">Btk
</gene-or-gene-product> gene</term> expression
</term>. </sentence>
Discourse Segments:<segment segID ="286" section = "D" segtype = "RegImplication">Here we show that</segment><segment segID ="287" section = "D" segtype = "Implication">
3 Annotation Systems on the same paper:CoreSC:
<annotationART atype="GSC" type="Res" conceptID="Res24" novelty="None" advantage="None">Here we show that BOB.1/OBF.1 regulates Btk gene expression.</annotationART> BioEvent/MetaKnowledge:<sentence id="S6">Here we show that <term id="T13" sem="Protein_family_or_group">
<gene-or-gene-product id="G9">BOB.1</gene-or-gene-product>/<gene-or-gene-product id="G10">OBF.1</gene-or-gene-product>
</term> regulates <term id="T14" sem="Biological_process">
<term id="T15" sem="DNA_domain_or_region"><gene-or-gene-product id="G11">Btk
</gene-or-gene-product> gene</term> expression
</term>. </sentence>
Discourse Segments:<segment segID ="286" section = "D" segtype = "RegImplication">Here we show that</segment><segment segID ="287" section = "D" segtype = "Implication">BOB.1/OBF.1 regulates Btk gene expression.</segment>
3 Annotation Systems on the same paper:CoreSC:
<annotationART atype="GSC" type="Res" conceptID="Res24" novelty="None" advantage="None">Here we show that BOB.1/OBF.1 regulates Btk gene expression.</annotationART> BioEvent/MetaKnowledge:<sentence id="S6">Here we show that <term id="T13" sem="Protein_family_or_group">
<gene-or-gene-product id="G9">BOB.1</gene-or-gene-product>/<gene-or-gene-product id="G10">OBF.1</gene-or-gene-product>
</term> regulates <term id="T14" sem="Biological_process">
<term id="T15" sem="DNA_domain_or_region"><gene-or-gene-product id="G11">Btk
</gene-or-gene-product> gene</term> expression
</term>. </sentence>
Discourse Segments:<segment segID ="286" section = "D" segtype = "RegImplication">Here we show that</segment><segment segID ="287" section = "D" segtype = "Implication">BOB.1/OBF.1 regulates Btk gene expression.</segment>
<event KT="Gen-Other" CL="L3" Manner="Neutral" Polarity=Positive"Source="Current" id="E16"><type class="Gene_expression"/><theme idref="G11"/><clue>Here we show that BOB.1/OBF.1 regulates Btk gene<clueType>expression</clueType>. </clue></event>
<event KT="Analysis" CL="L3" Manner="Neutral" Polarity=Positive"Source="Current" id="E17"><type class="Regulation"/><theme idref="E16"/><cause idref="T13"/><clue>Here we <clueKT>show</clueKT> that BOB.1/OBF.1<clueType>regulates</clueType> Btk gene expression. </clue></event>
3 Annotation Systems on the same paper:CoreSC:
<annotationART atype="GSC" type="Res" conceptID="Res24" novelty="None" advantage="None">Here we show that BOB.1/OBF.1 regulates Btk gene expression.</annotationART> BioEvent/MetaKnowledge:<sentence id="S6">Here we show that <term id="T13" sem="Protein_family_or_group">
<gene-or-gene-product id="G9">BOB.1</gene-or-gene-product>/<gene-or-gene-product id="G10">OBF.1</gene-or-gene-product>
</term> regulates <term id="T14" sem="Biological_process">
<term id="T15" sem="DNA_domain_or_region"><gene-or-gene-product id="G11">Btk
</gene-or-gene-product> gene</term> expression
</term>. </sentence>
Discourse Segments:<segment segID ="286" section = "D" segtype = "RegImplication">Here we show that</segment><segment segID ="287" section = "D" segtype = "Implication">BOB.1/OBF.1 regulates Btk gene expression.</segment>
<event KT="Gen-Other" CL="L3" Manner="Neutral" Polarity=Positive"Source="Current" id="E16"><type class="Gene_expression"/><theme idref="G11"/><clue>Here we show that BOB.1/OBF.1 regulates Btk gene<clueType>expression</clueType>. </clue></event>
<event KT="Analysis" CL="L3" Manner="Neutral" Polarity=Positive"Source="Current" id="E17"><type class="Regulation"/><theme idref="E16"/><cause idref="T13"/><clue>Here we <clueKT>show</clueKT> that BOB.1/OBF.1<clueType>regulates</clueType> Btk gene expression. </clue></event>
CoreSC vs Event Meta-knowledge
- Meta-knowledge event annotation can help to provide a more fine-grained analysis of CoreSC Background.
- Certainty Level and Source can help to refine Results and Conclusions
- More straightforward mappings occur between other categories, e.g. most sentences of the Motivation category contain only events of type Investigation.
- Categories such as Goal and Object are catered for by CoreSCs but not covered by the meta-knowledge scheme.
- Observation_L3_Current can be refined into CoreSC Obs, Res, Con and Hyp
CoreSC vs Segments
- In most cases natural mapping between the two schemes:- CoreSC Observation maps to Result, Res maps to Result and Implication. - CoreSC Conclusion maps to Implication and Hypothesis. - Implication consists of CoreSC Conclusion and Result. - Fact is CoreSC Background and Conclusion. - Hypothesis is CoreSC Hypothesis and Conclusion. - Problem is CoreSC Motivation.
- Most of CoreSC Bac maps to Fact and the Other categories, which refine it.
- CoreSCs refines Method and Result Segments
Segments vs Event Meta-knowledge
- Schemes can be complementary to each other - Segment types can refine the interpretation of Analysis events into Hypothesis,
Implication or Result. - Certainty level can help determine the confidence ascribed to the segments- Likewise, meta-knowledge can help to distinguish Result segments that
correspond either to analyses of results or experimental observations.
Conclusions (in detail):Common categories across the three schemes: (CoreSC Observation, Observation_L3_Current, Result)
(CoreSC Hypothesis, Analysis_L2_Current, Hypothesis)
(CoreSC Motivation, Investigation_L3_Current, Problem)
Categories that need refining from the three schemes: CoreSC: Background, ConclusionMetaknowledge: Gen_Other_L3_Current, Observation_L3_Current Segments: Method and Result
The three schemes have different strengths and offer annotation at different levels:
- CoreSC: complimenting the other two schemes, more fine grained Methods, Objectives and Results.
- Metaknowledge: Certainty levels and Source can help to refine the interpretation of certain CoreSC and segment types.
- Segments: Refinement of Background; signals for modality cues
Conclusions (general)
- Very small example, shows differences can be overcome. Each has advantages:
- Clause-level is most precise for identifying core claims
- Knowledge type/Certainty level are important refinement
- CoreSC refines methods and results and shows most promise for automated recognition
- So we need to work together!
- Plan to join forces; work on joint corpus
- Other work to add: KEfED, SWAN, ScholOnto
- Together develop a ‘claim identifier’ (not a fact extractor)+ standards for modality/evidence scales and types
- Work together towards claim-evidence network representation! (cf also Hypotheses, Evidence and Relationships)
The goal of the Workshop on “Models of Scientific Discourse Annotation” is to compare and contrast the motivation behind efforts in the discourse annotation of scientific text, the techniques and principles applied in the various approaches, and discuss ways in which they can complement each other and collaborate to form standards for an optimal method of annotating appropriate levels of discourse, with enhanced accuracy and usefulness.We wish to compare, contrast and evaluate different scientific discourse annotation schemes and tools, in order to answer questions such as:• What motivates a certain level, method, viewpoint for annotating scientific text?• What is the annotation level for a unit of argumentation: an event, a sentence, a segment? What are advantages and disadvantages of all three?• How easily can different schemes to be applied to texts? Are they easily trainable?• Which schemes are the most portable? Can they be applied to both full papers and abstracts? Can they be applied to texts in different domains?• How granular should annotation schemes be? What are the advantages/disadvantages of fine and coarse grained annotation categories?• Can different schemes complement each other to provide different levels of information? Can different schemes be combined to give better results?• How can we compare annotations, how do we decide which features, approaches, techniques work best?• How do we exchange and evaluate each other’s annotations?• How applicable are these efforts towards improved methods of publishing or summarizing science?
http://msda2011.wordpress.com/
Models of Scientific Discourse Annotation, Portland, OR, June 25
CoreSC References
Liakata, M. and Teufel, S. and Siddharthan, A. and Batchelor. 2010. Corpora for the conceptualisation and zoning of scientific papers. Proceedings of 7th International Conference on Language Resources and Evaluation, Malta.
Guo, Y. and Korhonen, A. and Liakata, M. and Silins, I and sSun, L. and Stenius, U. 2010.Identifying the Information Structure of Scientific Abstracts: An investigation of Three Different Schemes. Proceedings of BioNLP 2010, Uppsala, Sweden.
Liakata, M. and Q, Claire and Soldatova, S. 2009Semantic Annotation of Papers: Interface & Enrichment Tool (SAPIENT)Proceedings of BioNLP-09, 2009, Boulder, Colorado
Liakata M. and Soldatova L.N. 2008. Guidelines for the annotation of General Scientific Concepts. Aberystwyth University, JISC ProjectReport http://ie-repository.jisc.ac.uk/88/ 2008.
Soldatova L.N and Liakata M. 2007. An ontology methodology and CISP - the proposed Core Information about Scientific Papers. JISC Project Report, http://ie-repository.jisc.ac.uk/137/.
Meta-Annotation References
Ananiadou, S., Thompson, P. and Nawaz, R. (2010). "Improving Search Through Event-based Biomedical Text Mining. In Proceedings of First International Workshop on Automated Motif Discovery in Cultural Heritage and Scientific Communication Texts (AMICUS 2010).
Nawaz, R., Thompson, P., McNaught, J. and Ananiadou, S. (2010). Meta-Knowledge Annotation of Bio-Events. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pp. 2498-2505
Nawaz, R., Thompson, P. and Ananiadou, S. (2010). Evaluating a Meta-Knowledge Annotation Scheme for Bio-Events. In Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 69-77
Nawaz, R., Thompson, P. and Ananiadou, S. (2010). Event Interpretation: A Step towards Event-Centred Text Mining. In Proccedings of the First International Workshop on Automated Motif Discovery in Cultural Heritage and Scientific Communication Texts (AMICUS 2010).
Discourse Segment Referencesde Waard, A. (2010d). The Story of Science: A syntagmatic/paradigmatic analysis of scientific text. Proceedings of the AMICUS Workshop, Vienna, Austria, October 2010.de Waard, A., and Pandermaat, H. (2010). A Classification of Research Verbs to Facilitate Discourse Segment Identification in Biological Text, Proceedings of the Interdisciplinary Workshop on Verbs. The Identification and Representation of Verb Features, Pisa, Italy, November 4-5 2010.de Waard, A. (2010c). The Future of the Journal? Integrating research data with scientific discourse, Logos vol. 21, issues 1-2, January 2011.de Waard, A. (2010b). From Proteins to Fairytales: Directions in Semantic Publishing. IEEE Intelligent Systems 25(2): 83-88 (2010)de Waard, A. (2010a). Realm Traversal In Biological Discourse: From Model To Experiment and back again, Workshop on Multidisciplinary Perspectives on Signalling Text Organisation (MAD 2010), March 17-20, 2010, Moissac, France.de Waard, A. (2009b), Categorizing Epistemic Segment Types in Biology Research Articles. Workshop on Linguistic and Psycholinguistic Approaches to Text Structuring (LPTS 2009), September 21-23 2009. – to be published as a chapter in Linguistic and Psycholinguistic Approaches to Text Structuring, Laure Sarda, Shirley Carter Thomas & Benjamin Fagard (eds), John Benjamins, (planned for 2010).de Waard, A., Simon Buckingham Shum, Annamaria Carusi, Jack Park, Matthias Samwald and Ágnes Sándor. (2009). Hypotheses, Evidence and Relationships: The HypER Approach for Representing Scientific Knowledge Claims, Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009), co-located with the 8th International Semantic Web Conference (ISWC-2009).de Waard, A. Buitelaar, P., & Eigner, T. (2009), Identifying the Epistemic Value of Discourse Segments in Biology Texts, In: Proceedings of the Eighth International Conference on Computational Semantics, Tilburg, The Netherlands, Jan.7-9 2009.