35
Categorizing Epistemic Segment Types in Biology Research Articles Anita de Waard Elsevier Labs, Amsterdam UiL-OTS, Utrecht University 1 Thursday, September 17, 2009

Epistemics

Embed Size (px)

Citation preview

Page 1: Epistemics

Categorizing Epistemic Segment Types in Biology

Research Articles

Anita de WaardElsevier Labs, Amsterdam

UiL-OTS, Utrecht University

1Thursday, September 17, 2009

Page 2: Epistemics

Introduction

2Thursday, September 17, 2009

Page 3: Epistemics

Why Study Biological Discourse?

- There is too much of it!

- Text mining and ‘fact extraction’ techniques are gaining ground to tame thistangle

- Emerging area of biologicalnatural language processing (BioNLP): subfield of computational linguistics

- Main focus: identifying biological entities (genes, proteins, drugs) and their relationships

3Thursday, September 17, 2009

Page 4: Epistemics

Example state of the art: MEDIE

Previous studies have implicated miR-34a as a tumor suppressor gene whose transcription is activated by p53.

Alteration of nm23, P53, and S100A4 expression may contribute to the development of gastric

without some idea of the status of the sentence, it cannot be interpreted!

4Thursday, September 17, 2009

Page 5: Epistemics

How can linguistics help?Underlying model of text mining systems:

- Scientific paper is ‘statement of pertinent facts’

- So: finding entities and relationships will give you a summary of the knowledge within the paper

- However, information extracted this way is not very useful....

Proposed approach: treat scientific paper as a persuasive text: specific genre, with genre characteristics and allowed persuasive techniques:

- ‘these results suggest’ (depersonification)

- ‘as fig. 2a shows’ (evidence is in the data)

- ‘oncogenes produce a stress response [Serrano, 2003]’

References and data form a “folded array of successive defense lines, behind which scientists ensconce themselves” [Latour, 1988]

5Thursday, September 17, 2009

Page 6: Epistemics

Modality Dropping

- Fact creation occurs through social acceptance: “[Y]ou can transform .. fiction into fact just by adding or subtracting references” [Latour, 1988]

- When references are cited the modality is dropped:

- A: ‘these results suggest/demonstate/imply that’ X

- B: ‘A et al. have shown that X [A, 2009]’

- C: ‘X [2009]’

- D: ‘Since X, we investigated the possibility that Y’

6Thursday, September 17, 2009

Page 7: Epistemics

Overall Research Questions

I. (How) can we add epistemic value to results from a text mining system?

II. How is a scientific fact created, as it moves from a hedged claim to a throughout successive citations?

III. Can we identify a rhetorically successful text (and help authors create them)?

7Thursday, September 17, 2009

Page 8: Epistemics

Present work:

Perform discourse analysis on a few selected texts in biology:

1. Parse text into discourse segments (edu’s) containing a single rhetorical move (if possible...)

2. Determine categories or types of discourse segments that have similar rhetorical/pragmatic properties

3. Look at a number of linguistic characteristics and see if these segments share those characteristics.

8Thursday, September 17, 2009

Page 9: Epistemics

Present research questions:

i. Can these segments indeed be grouped by linguistic characteristics (verb tense, verb registry, metadiscourse markers?)

ii. Does this offer a useful version of the structure of a paper?

iii. Is this useful for enabling automated epistemic markup?

iv. Can this help us to trace evolution of a hypothesis?

9Thursday, September 17, 2009

Page 10: Epistemics

Methods

10Thursday, September 17, 2009

Page 11: Epistemics

Method

1. Parse text into Discourse Segments (EDUs) according to syntactic criteria

2. Define set of semantic segment types

3. Identify semantic type for each segment

4. Specify linguistic and structural properties for each segment

5. Identify correlations between semantic type and structural/syntactic properties

6. Trace a hypothesis through the process of fact creation

11Thursday, September 17, 2009

Page 12: Epistemics

Segmentation Criteria

Goal: ‘one new thought per segment’:

Figure 4A shows that following RASV12 stimulation, p53 was stabilized and activated, and its target gene, p21cip1, was induced in all cases, indicating an intact p53 pathway in these cells.

a. Figure 4a shows thatb. following RASV12 stimulationc. p53 was stabilized and activatedd. and the target gene, p21cip1, was induced in all cases,e. indicating an intact p53 pathway in these cells.

12Thursday, September 17, 2009

Page 13: Epistemics

Segmentation Criteria (summary)

Finite/Non-finite

Grammatical role Segment? Example

Finite/Non-finite Subject N The extent to which miRNAs specifically affect metastasis

Finite/Non-finite Direct Object Y these miRNAs are potential novel oncogenes

Nonfinite Phrase-level adjunct (restrictive and non-restrictive)

N spanning a given miRNA genomic region

Nonfinite Clause-level adjunct Y by cloning eight miR-Vec plasmids

Finite Non-restrictive Phrase-level adjunct Y which is only active when tamoxifen is added (De Vita et al, 2005) […]

Finite Restrictive Phrase-level adjunct N that we examined

Finite Clause-level adjunct Ywhich correlates with the reported ES-cell expression pattern of the miR-371-3 cluster (Suh et al, 2004)

13Thursday, September 17, 2009

Page 14: Epistemics

Basic Segment TypesSegment Description Example

Fact a known fact, generally without explicit citation

mature miR-373 is a homolog of miR-372

Hypothesis a proposed idea, not supported by evidence This could for instance be a result of high mdm2 levels

Problem unresolved, contradictory, or unclear issue

However, further investigation is required to demonstrate the exact mechanism of LATS2 action

Goal research goal To identify novel functions of miRNAs,

Method experimental method Using fluorescence microscopy and luciferase assays,

Result a restatement of the outcome of an experiment

all constructs yielded high expression levels of mature miRNAs

Implication an interpretation of the results, in light of earlier

hypotheses and facts

our procedure is sensitive enough to detect mild growth differences

14Thursday, September 17, 2009

Page 15: Epistemics

Two Types of Derived Segment Types

‘Other-segments’, related to (referenced) other work:

- other-result: ‘they are also found in the FCX and other cortical structures ([Sokoloff et al., 1990]’

- other-goal: ‘the role of D3 receptors in the control of motivation and affect has been intensively studied [Heidbreder et al., 2005]’

- other-implication: ‘D1 or, more likely, D5, receptors have been implicated in mechanisms underlying long-term spatial memory [Hersi et al., 1995]’

Regulatory segments, acting as matrix sentences framing other segments:

- reg-hypothesis: ‘we hypothesized that ’

- reg-implication: ‘These observations suggest that’

- intratextual: ‘Fig 4 shows that’

- intertextual: ‘reviewed in (Serrano, 1997)’

15Thursday, September 17, 2009

Page 16: Epistemics

My categories vs. Latour (1979)

16Thursday, September 17, 2009

Page 17: Epistemics

Linguistic and structural properties1. Position in text

- Section of the paper (Introduction, Results, Discussion)

- Beginning/middle/end of section

- First/second third part of sentence

2. Verb:

- Tense, aspect, voice

- Verb class (idiosyncratic)

- Lexicon

3. Metadiscourse markers [Hyland, 2003]:

- Connectives

- Endophorics, Evidentials

- Hedges, Boosters

- Person markers

17Thursday, September 17, 2009

Page 18: Epistemics

Verb classTwo types of entities interact in biology texts:

- Thing:

- Thing -> Increase, die, etc

- Thing-thing: affect, stimulate etc.

- People:

- People -> Thing:

- Examine (Goal)

- Operate (Method)

- Observe (Result)

- Implicate (Implication)

- People - people: Report

18Thursday, September 17, 2009

Page 19: Epistemics

Results

19Thursday, September 17, 2009

Page 20: Epistemics

Two texts

1. Voorhoeve, 2006: Cell

- Cell biology text, written by group in Amsterdam

- Dealing with microRNAs - hot topic

- 290 citations in Google Scholar: succesful paper!

2. Louiseau, 2008: European Neuropsychopharmacology

- Text on schizophrenia

- Prompted by interest from Pharma company

- Adjacent subfield of biology (neuropharmacology)

20Thursday, September 17, 2009

Page 21: Epistemics

Segment vs. Section

21Thursday, September 17, 2009

Page 22: Epistemics

Segment vs. Verb Type

22Thursday, September 17, 2009

Page 23: Epistemics

Segment vs. verb tense

23Thursday, September 17, 2009

Page 24: Epistemics

Segments vs. markers

24Thursday, September 17, 2009

Page 25: Epistemics

Segment Order

25Thursday, September 17, 2009

Page 26: Epistemics

Discussion

26Thursday, September 17, 2009

Page 27: Epistemics

Interpretation: 3 Realms of Science:

Experimental realm

Data realm

Conceptual realm

(1) Oncogene-induced senescence is characterized by the appearance of cells with a flat morphology that express senescence associated (SA)-

-Galactosidase.

(4b) transduction with either miR-Vec-371&2 or miR-Vec-373 prevents RASV12-induced growth arrest in primary human cells.

(2b) control RASV12-arrested cells showed relatively high abundance of flat cells expressing SA- -Galactosidase

(3b) very few cells showed senescent morphology when transduced with either miR-Vec-371&2, miR-Vec-373, or control p53kd.

(4a) Altogether, these data show that

(3a) Consistent with the cell growth assay,

(Figures)

(2a) Indeed,

(2c) (Figures 2G and 2H).

27Thursday, September 17, 2009

Page 28: Epistemics

Tense 1: Concepts vs. Experiment

(1) Oncogene-induced senescence is characterized by the appearance of cells with a flat morphology that express senescence associated (SA)-

-Galactosidase.

(4b) transduction with either miR-Vec-371&2 or miR-Vec-373 prevents RASV12-induced growth arrest in primary human cells.

(2b) control RASV12-arrested cells showed relatively high abundance of flat cells expressing SA- -Galactosidase

(3b) very few cells showed senescent morphology when transduced with either miR-Vec-371&2, miR-Vec-373, or control p53kd.

(4a) Altogether, these data show that

(3a) Consistent with the cell growth assay,

(Figures)

(2a) Indeed,

Expe

r ime n

t al r

ealm

( p

ers o

nal,

past

) D

a ta

real

m

(non

tver

bal)

(2c) (Figures 2G and 2H).

Con

c ep t

real

m

28Thursday, September 17, 2009

Page 29: Epistemics

Tense 2: Referral

Introduction

Current work (= Results section)

After current work: past

Before current work: present

Discussion

present past future

othe

r pap

ers

own

pap e

r

Other Work

After other work: past

29Thursday, September 17, 2009

Page 30: Epistemics

Tense 1+ 2 = 3:

Reading time

Expe

rient

ial

Con

cept

ual

past present future

Claim, fact

Experiment

30Thursday, September 17, 2009

Page 31: Epistemics

goal

to

hypothetical realm: (might, would)

realm of activity: (to test, to see)

realm of models: present

realm of experience:

past

wemethod

resultresulting in

Discourse Fact-ory

suggests that

implication

discussion

Own viewShared view

hypothesis

fact fact fact

problem

introduction

results

discussion

31Thursday, September 17, 2009

Page 32: Epistemics

Concepts

KnownFact KnownFact

Experiment 1

Goal

Result

Data

Method

Goal

Experiment 2

Data

Method Result

Citation and fact creation:

Hypothesis

To investigate the possibility that miR-372 and miR-373 suppress the

expression of LATS2, we...

Implication

Therefore, these results point toLATS2 as a mediator of the miR-372 and miR-373 effects on cell proliferation and tumorigenicity,

Fact

two miRNAs, miRNA-372 and-373, function as potential novel oncogenes in testicular germ cell

tumors by inhibition of LATS2 expression, which suggests that Lats2 is an important tumor

suppressor (Voorhoeve et al., 2006).

Raver-Shapira et.al, JMolCell 2007

miR-372 and miR-373 target the Lats2 tumor suppressor (Voorhoeve et al., 2006)

Yabuta, JBioChem 2007

Voorhoeve, 2006

32Thursday, September 17, 2009

Page 33: Epistemics

Answers to current research questions:i. Can these segments indeed be identified?

✓ yes, adequate evidence, probably ok segments:

‣ need more annotators!

ii. Does this offer a useful version of the structure of a paper?

✓ yes, offers insight, and a possible model

‣ need to be validated whether this structure holds over more papers, different subcategories

iii. Is this useful for enabling automated epistemic markup?

✓ first efforts seem promising: simple markers (‘suggest’ verbs, connectives, etc.) already help

‣ ongoing research! (Sandor, XRCE; Buitelaar, DERI)

iv. Can this help us to trace the evolution of a hypothesis?

✓ anecdotal: promising

‣ need to scale up!

33Thursday, September 17, 2009

Page 34: Epistemics

Where are we on overall research questions?

I. (How) can we add epistemic value to results from a text mining system?

‣ Segment types help - need to expand + verify

II. How is a scientific fact created, as it moves from a hedged claim to a throughout successive citations?

‣ Model is developing, also spurt of other work!

III. Can we identify a rhetorically successful text (and help authors create them)?

‣ Not addressed yet - verb tense, hedging seem important.

34Thursday, September 17, 2009

Page 35: Epistemics

Work on (biological) scientific discourse

- Is a growing field of interest!

- Several projects developing going ‘beyond the facts’

- Epistemic modality is becoming a term bioinformaticians are exploring

- Room for people who know about discourse analysis!

35Thursday, September 17, 2009