Upload
anita-de-waard
View
472
Download
0
Tags:
Embed Size (px)
Citation preview
Categorizing Epistemic Segment Types in Biology
Research Articles
Anita de WaardElsevier Labs, Amsterdam
UiL-OTS, Utrecht University
1Thursday, September 17, 2009
Introduction
2Thursday, September 17, 2009
Why Study Biological Discourse?
- There is too much of it!
- Text mining and ‘fact extraction’ techniques are gaining ground to tame thistangle
- Emerging area of biologicalnatural language processing (BioNLP): subfield of computational linguistics
- Main focus: identifying biological entities (genes, proteins, drugs) and their relationships
3Thursday, September 17, 2009
Example state of the art: MEDIE
Previous studies have implicated miR-34a as a tumor suppressor gene whose transcription is activated by p53.
Alteration of nm23, P53, and S100A4 expression may contribute to the development of gastric
without some idea of the status of the sentence, it cannot be interpreted!
4Thursday, September 17, 2009
How can linguistics help?Underlying model of text mining systems:
- Scientific paper is ‘statement of pertinent facts’
- So: finding entities and relationships will give you a summary of the knowledge within the paper
- However, information extracted this way is not very useful....
Proposed approach: treat scientific paper as a persuasive text: specific genre, with genre characteristics and allowed persuasive techniques:
- ‘these results suggest’ (depersonification)
- ‘as fig. 2a shows’ (evidence is in the data)
- ‘oncogenes produce a stress response [Serrano, 2003]’
References and data form a “folded array of successive defense lines, behind which scientists ensconce themselves” [Latour, 1988]
5Thursday, September 17, 2009
Modality Dropping
- Fact creation occurs through social acceptance: “[Y]ou can transform .. fiction into fact just by adding or subtracting references” [Latour, 1988]
- When references are cited the modality is dropped:
- A: ‘these results suggest/demonstate/imply that’ X
- B: ‘A et al. have shown that X [A, 2009]’
- C: ‘X [2009]’
- D: ‘Since X, we investigated the possibility that Y’
6Thursday, September 17, 2009
Overall Research Questions
I. (How) can we add epistemic value to results from a text mining system?
II. How is a scientific fact created, as it moves from a hedged claim to a throughout successive citations?
III. Can we identify a rhetorically successful text (and help authors create them)?
7Thursday, September 17, 2009
Present work:
Perform discourse analysis on a few selected texts in biology:
1. Parse text into discourse segments (edu’s) containing a single rhetorical move (if possible...)
2. Determine categories or types of discourse segments that have similar rhetorical/pragmatic properties
3. Look at a number of linguistic characteristics and see if these segments share those characteristics.
8Thursday, September 17, 2009
Present research questions:
i. Can these segments indeed be grouped by linguistic characteristics (verb tense, verb registry, metadiscourse markers?)
ii. Does this offer a useful version of the structure of a paper?
iii. Is this useful for enabling automated epistemic markup?
iv. Can this help us to trace evolution of a hypothesis?
9Thursday, September 17, 2009
Methods
10Thursday, September 17, 2009
Method
1. Parse text into Discourse Segments (EDUs) according to syntactic criteria
2. Define set of semantic segment types
3. Identify semantic type for each segment
4. Specify linguistic and structural properties for each segment
5. Identify correlations between semantic type and structural/syntactic properties
6. Trace a hypothesis through the process of fact creation
11Thursday, September 17, 2009
Segmentation Criteria
Goal: ‘one new thought per segment’:
Figure 4A shows that following RASV12 stimulation, p53 was stabilized and activated, and its target gene, p21cip1, was induced in all cases, indicating an intact p53 pathway in these cells.
a. Figure 4a shows thatb. following RASV12 stimulationc. p53 was stabilized and activatedd. and the target gene, p21cip1, was induced in all cases,e. indicating an intact p53 pathway in these cells.
12Thursday, September 17, 2009
Segmentation Criteria (summary)
Finite/Non-finite
Grammatical role Segment? Example
Finite/Non-finite Subject N The extent to which miRNAs specifically affect metastasis
Finite/Non-finite Direct Object Y these miRNAs are potential novel oncogenes
Nonfinite Phrase-level adjunct (restrictive and non-restrictive)
N spanning a given miRNA genomic region
Nonfinite Clause-level adjunct Y by cloning eight miR-Vec plasmids
Finite Non-restrictive Phrase-level adjunct Y which is only active when tamoxifen is added (De Vita et al, 2005) […]
Finite Restrictive Phrase-level adjunct N that we examined
Finite Clause-level adjunct Ywhich correlates with the reported ES-cell expression pattern of the miR-371-3 cluster (Suh et al, 2004)
13Thursday, September 17, 2009
Basic Segment TypesSegment Description Example
Fact a known fact, generally without explicit citation
mature miR-373 is a homolog of miR-372
Hypothesis a proposed idea, not supported by evidence This could for instance be a result of high mdm2 levels
Problem unresolved, contradictory, or unclear issue
However, further investigation is required to demonstrate the exact mechanism of LATS2 action
Goal research goal To identify novel functions of miRNAs,
Method experimental method Using fluorescence microscopy and luciferase assays,
Result a restatement of the outcome of an experiment
all constructs yielded high expression levels of mature miRNAs
Implication an interpretation of the results, in light of earlier
hypotheses and facts
our procedure is sensitive enough to detect mild growth differences
14Thursday, September 17, 2009
Two Types of Derived Segment Types
‘Other-segments’, related to (referenced) other work:
- other-result: ‘they are also found in the FCX and other cortical structures ([Sokoloff et al., 1990]’
- other-goal: ‘the role of D3 receptors in the control of motivation and affect has been intensively studied [Heidbreder et al., 2005]’
- other-implication: ‘D1 or, more likely, D5, receptors have been implicated in mechanisms underlying long-term spatial memory [Hersi et al., 1995]’
Regulatory segments, acting as matrix sentences framing other segments:
- reg-hypothesis: ‘we hypothesized that ’
- reg-implication: ‘These observations suggest that’
- intratextual: ‘Fig 4 shows that’
- intertextual: ‘reviewed in (Serrano, 1997)’
15Thursday, September 17, 2009
My categories vs. Latour (1979)
16Thursday, September 17, 2009
Linguistic and structural properties1. Position in text
- Section of the paper (Introduction, Results, Discussion)
- Beginning/middle/end of section
- First/second third part of sentence
2. Verb:
- Tense, aspect, voice
- Verb class (idiosyncratic)
- Lexicon
3. Metadiscourse markers [Hyland, 2003]:
- Connectives
- Endophorics, Evidentials
- Hedges, Boosters
- Person markers
17Thursday, September 17, 2009
Verb classTwo types of entities interact in biology texts:
- Thing:
- Thing -> Increase, die, etc
- Thing-thing: affect, stimulate etc.
- People:
- People -> Thing:
- Examine (Goal)
- Operate (Method)
- Observe (Result)
- Implicate (Implication)
- People - people: Report
18Thursday, September 17, 2009
Results
19Thursday, September 17, 2009
Two texts
1. Voorhoeve, 2006: Cell
- Cell biology text, written by group in Amsterdam
- Dealing with microRNAs - hot topic
- 290 citations in Google Scholar: succesful paper!
2. Louiseau, 2008: European Neuropsychopharmacology
- Text on schizophrenia
- Prompted by interest from Pharma company
- Adjacent subfield of biology (neuropharmacology)
20Thursday, September 17, 2009
Segment vs. Section
21Thursday, September 17, 2009
Segment vs. Verb Type
22Thursday, September 17, 2009
Segment vs. verb tense
23Thursday, September 17, 2009
Segments vs. markers
24Thursday, September 17, 2009
Segment Order
25Thursday, September 17, 2009
Discussion
26Thursday, September 17, 2009
Interpretation: 3 Realms of Science:
Experimental realm
Data realm
Conceptual realm
(1) Oncogene-induced senescence is characterized by the appearance of cells with a flat morphology that express senescence associated (SA)-
-Galactosidase.
(4b) transduction with either miR-Vec-371&2 or miR-Vec-373 prevents RASV12-induced growth arrest in primary human cells.
(2b) control RASV12-arrested cells showed relatively high abundance of flat cells expressing SA- -Galactosidase
(3b) very few cells showed senescent morphology when transduced with either miR-Vec-371&2, miR-Vec-373, or control p53kd.
(4a) Altogether, these data show that
(3a) Consistent with the cell growth assay,
(Figures)
(2a) Indeed,
(2c) (Figures 2G and 2H).
27Thursday, September 17, 2009
Tense 1: Concepts vs. Experiment
(1) Oncogene-induced senescence is characterized by the appearance of cells with a flat morphology that express senescence associated (SA)-
-Galactosidase.
(4b) transduction with either miR-Vec-371&2 or miR-Vec-373 prevents RASV12-induced growth arrest in primary human cells.
(2b) control RASV12-arrested cells showed relatively high abundance of flat cells expressing SA- -Galactosidase
(3b) very few cells showed senescent morphology when transduced with either miR-Vec-371&2, miR-Vec-373, or control p53kd.
(4a) Altogether, these data show that
(3a) Consistent with the cell growth assay,
(Figures)
(2a) Indeed,
Expe
r ime n
t al r
ealm
( p
ers o
nal,
past
) D
a ta
real
m
(non
tver
bal)
(2c) (Figures 2G and 2H).
Con
c ep t
real
m
28Thursday, September 17, 2009
Tense 2: Referral
Introduction
Current work (= Results section)
After current work: past
Before current work: present
Discussion
present past future
othe
r pap
ers
own
pap e
r
Other Work
After other work: past
29Thursday, September 17, 2009
Tense 1+ 2 = 3:
Reading time
Expe
rient
ial
Con
cept
ual
past present future
Claim, fact
Experiment
30Thursday, September 17, 2009
goal
to
hypothetical realm: (might, would)
realm of activity: (to test, to see)
realm of models: present
realm of experience:
past
wemethod
resultresulting in
Discourse Fact-ory
suggests that
implication
discussion
Own viewShared view
hypothesis
fact fact fact
problem
introduction
results
discussion
31Thursday, September 17, 2009
Concepts
KnownFact KnownFact
Experiment 1
Goal
Result
Data
Method
Goal
Experiment 2
Data
Method Result
Citation and fact creation:
Hypothesis
To investigate the possibility that miR-372 and miR-373 suppress the
expression of LATS2, we...
Implication
Therefore, these results point toLATS2 as a mediator of the miR-372 and miR-373 effects on cell proliferation and tumorigenicity,
Fact
two miRNAs, miRNA-372 and-373, function as potential novel oncogenes in testicular germ cell
tumors by inhibition of LATS2 expression, which suggests that Lats2 is an important tumor
suppressor (Voorhoeve et al., 2006).
Raver-Shapira et.al, JMolCell 2007
miR-372 and miR-373 target the Lats2 tumor suppressor (Voorhoeve et al., 2006)
Yabuta, JBioChem 2007
Voorhoeve, 2006
32Thursday, September 17, 2009
Answers to current research questions:i. Can these segments indeed be identified?
✓ yes, adequate evidence, probably ok segments:
‣ need more annotators!
ii. Does this offer a useful version of the structure of a paper?
✓ yes, offers insight, and a possible model
‣ need to be validated whether this structure holds over more papers, different subcategories
iii. Is this useful for enabling automated epistemic markup?
✓ first efforts seem promising: simple markers (‘suggest’ verbs, connectives, etc.) already help
‣ ongoing research! (Sandor, XRCE; Buitelaar, DERI)
iv. Can this help us to trace the evolution of a hypothesis?
✓ anecdotal: promising
‣ need to scale up!
33Thursday, September 17, 2009
Where are we on overall research questions?
I. (How) can we add epistemic value to results from a text mining system?
‣ Segment types help - need to expand + verify
II. How is a scientific fact created, as it moves from a hedged claim to a throughout successive citations?
‣ Model is developing, also spurt of other work!
III. Can we identify a rhetorically successful text (and help authors create them)?
‣ Not addressed yet - verb tense, hedging seem important.
34Thursday, September 17, 2009
Work on (biological) scientific discourse
- Is a growing field of interest!
- Several projects developing going ‘beyond the facts’
- Epistemic modality is becoming a term bioinformaticians are exploring
- Room for people who know about discourse analysis!
35Thursday, September 17, 2009