Computing Representations of the Rhetorical · Web viewComputing Representations of the Structure of Written Discourse Simon Corston-Oliver Rasta (Rhetorical Structure Theory Analyzer),

.

UNIVERSITY OF CALIFORNIASanta Barbara

Computing Representations of the Structure of Written Discourse

A Dissertation submitted in partial satisfactionof the requirements for the degree of

Doctor of Philosophy

in

Linguistics

by

Simon Henderson Corston-Oliver

Committee in charge:

Professor Susanna Cumming, Chairperson

Dr. William Dolan

Professor Carol Genetti

Professor Sandra Thompson

March 1998

.

The dissertation of Simon Corston-Oliver is approved

Committee Chairperson

June 12, 1998

ii

31 March 1998

Copyright by

Simon Corston-Oliver

1998

iii

To Mo

iv

Simon Henderson Corston-Oliver

Curriculum vitae

6 May, 2023

Date and place of birth

22 March 1969, Christchurch, New Zealand.

Education

1998 Doctor of Philosophy, Linguistics, University of California, Santa Barbara, U.S.A. Thesis: “Computing Representations of the Structure of Written Discourse.”

1993-1994 Education abroad. Study of Mandarin Chinese at Beijing Daxue (Peking University), People’s Republic of China.

1993 Master of Arts with First Class Honors, Linguistics, Auckland University, New Zealand. Thesis: ‘Ergativity in Roviana.’

1991 Bachelor of Arts, Linguistics, Auckland University, New Zealand.

v

Awards

1994-1996 Special Regents’ Fellowship, University of California, Santa Barbara, U.S.A.

1994 Junior Fellowship, Interdisciplinary Humanities Center, University of California, Santa Barbara, U.S.A.

1993 New Zealand - China Exchange Program Scholarship for study at Beijing Daxue (Peking University), People's Republic of China. Ministry of External Relations and Trade, New Zealand Government.

1993 Auckland University Graduate Scholarship, Auckland University, New Zealand.

1993 Department research grant, Department of Linguistics, Auckland University, New Zealand, for research on Roviana.

1992,1993 Auckland University Graduate Scholarship, Auckland University, New Zealand.

1992 Senior Scholar in Linguistics, Auckland University, New Zealand.

1990 Annual Prize in Linguistics, Auckland University, New Zealand.

Publications

Corston, Simon H. 1993. ‘On the interactive nature of spontaneous oral

narrative.’ Te Reo 36:69-97.

Corston, Simon H. 1993. Ergativity in Roviana. M.A. Thesis. Auckland

vi

University, New Zealand.

Corston, Simon H. 1996. Ergativity in Roviana, Solomon Islands. Pacific

Linguistics, Series B-113. Australia National University Press:

Canberra.

Corston-Oliver, Simon H. To appear. ‘Beyond string matching and cue

phrases: Improving efficiency and coverage in discourse analysis.’

Proceedings of the AAAI Spring Symposium on Intelligent Text

Summarization, March 23-25, 1998.

Corston-Oliver, Simon H. To appear. ‘Roviana.’ In Crowley, Terry, John

Lynch and Malcolm Ross (eds.) Oceanic Languages. Edinburgh:

Edinburgh University Press.

Corston-Oliver, Simon H. To appear. ‘The marking of core arguments and the

inversion of the Nominal Hierarchy in Roviana’. In Proceedings of the

Conference on Preferred Argument Structure: The Next Generation.

Kumpf, Lorraine E. and John W. Dubois (eds.)

vii

Selected professional experience

1996- Computational linguist, Microsoft Research, Redmond, WA, U.S.A.

1995-1996 Computer laboratory technician, Linguistics Department, University of California, Santa Barbara, U.S.A.

1994-1995 Computer programmer, Corpus of Spoken American English, University of California, Santa Barbara, U.S.A.

1990-1994 Computer programmer, Lockie Computing, Auckland, New Zealand.

1993 Teaching assistant to Professor Frank Lichtenberk, Department of Linguistics, Auckland University, Auckland, New Zealand.

viii

ABSTRACT

Computing Representations of the Structure of Written Discourse

Simon Corston-Oliver

RASTA (Rhetorical Structure Theory Analyzer), a discourse analysis

component within the Microsoft English Grammar, efficiently computes

representations of the structure of written discourse using cue phrases and

additional information available in syntactic and logical form analyses of a text.

RASTA heuristically scores the rhetorical relations that it hypothesizes, using

those scores to guide it in producing more plausible discourse representations

before less plausible ones. The heuristic scores also provide a genre-

independent method for evaluating competing discourse analyses: the best

discourse analyses are those constructed from the strongest hypotheses.

This dissertation describes in detail a set of linguistic cues that can be

identified in a text as evidence of discourse relations, and gives complete and

explicit algorithms for identifying the terminal nodes of a discourse analysis

and for efficiently combining those terminal nodes to form hierarchical

representations of discourse structure.

ix

TABLE OF CONTENTS

1. Introduction............................................................................................. 1

2. Data.......................................................................................................... 6

3. Rhetorical Structure Theory...................................................................93.1 Introduction........................................................................................... 93.2 Overview............................................................................................... 93.3 Conditions on the structure of trees.....................................................153.4 Formalizing the relations.....................................................................153.5 The set of relations..............................................................................183.6 Underspecified Rhetorical Structure Theory........................................263.7 Schemas.............................................................................................. 333.8 Conclusion........................................................................................... 37

4. Previous Work on Computing Discourse Representations.................394.1 Introduction......................................................................................... 394.2 Rhetorical Structure Theory.................................................................39

4.2.1 Mann and Thompson (1986, 1988)..............................................394.2.2 Sumita et al. (1992) and Ono et al. (1994)...................................404.2.3 Kurohashi and Nagao (1994).......................................................424.2.4 Fukumoto and Tsujii (1994)........................................................424.2.5 Wu and Lytinen (1990)................................................................434.2.6 Marcu (1996, 1997a)...................................................................44

4.3 PISA..................................................................................................... 524.4 Hobbs (1979)....................................................................................... 564.5 The Linguistic Discourse Model (LDM)..............................................584.6 Conclusion........................................................................................... 59

5. The Microsoft English Grammar.........................................................625.1 Introduction......................................................................................... 62

x

5.2 Lexicon...............................................................................................645.3 Syntax.................................................................................................64

5.3.1 Sketch..........................................................................................645.3.2 Portrait......................................................................................... 68

5.4 Logical Form....................................................................................... 705.5 Word Sense Disambiguation................................................................745.6 Discourse............................................................................................. 755.7 MINDNET.............................................................................................755.8 Conclusion........................................................................................... 76

6. Cues to Discourse Structure..................................................................776.1 Introduction......................................................................................... 776.2 Correlations between clausal status and rhetorical status.....................786.3 The role of anaphora, deixis and referential continuity........................806.4 Heuristic scores...................................................................................826.5 Necessary criteria and cues..................................................................836.6 Dependence on a set of relations..........................................................846.7 Cues to the relations............................................................................85

6.7.1 ASYMMETRICCONTRAST..............................................................866.7.2 CAUSE..........................................................................................946.7.3 CIRCUMSTANCE..........................................................................1056.7.4 CONCESSION...............................................................................1106.7.5 CONDITION.................................................................................1156.7.6 CONTRAST..................................................................................1186.7.7 ELABORATION............................................................................1276.7.8 JOINT.......................................................................................... 1326.7.9 LIST...........................................................................................1366.7.10 MEANS.......................................................................................1426.7.11 PURPOSE....................................................................................1436.7.12 RESULT......................................................................................1456.7.13 SEQUENCE..................................................................................153

7. Constructing Trees..............................................................................1707.1 Introduction.......................................................................................1707.2 The need for an improved algorithm..................................................1707.3 Identify terminal nodes......................................................................172

xi

7.4 Posit hypotheses................................................................................1757.5 Construct trees...................................................................................178

7.5.1 Promotion sets...........................................................................1787.5.2 Group mutually exclusive hypotheses........................................1837.5.3 Produce and rank binary-branching trees...................................1857.5.4 Produce n-ary branching trees....................................................1927.5.5 Learning the heuristic scores......................................................198

7.6 Worked example................................................................................203

8. RASTA’s contributions to the field......................................................2218.1 Introduction.......................................................................................2218.2 Identifying rhetorical relations...........................................................2218.3 Representations of knowledge...........................................................2258.4 Constructing and evaluating trees......................................................2288.5 Genre................................................................................................. 229

9. Potential Applications for RASTA.......................................................2319.1 Introduction.......................................................................................2319.2 Text summarization...........................................................................2319.3 The creation of semantic networks.....................................................2379.4 Information retrieval..........................................................................2389.5 Quantitative analysis of discourse patterns.........................................239

10. Conclusion............................................................................................ 241

xii

TABLE OF FIGURES

Figure 1 Waterloo, Battle of............................................................................11

Figure 2 Pseudepigrapha.................................................................................12

Figure 3 Trafalgar, Battle of............................................................................14

Figure 4 Definition of the VOLITIONAL CAUSE relation...................................17

Figure 5 Taxonomy of discourse relations.......................................................29

Figure 6 Echidna............................................................................................. 32

Figure 7 RST schemas (Mann and Thompson 1988:247).................................34

Figure 8 Alternative discourse structures.........................................................36

Figure 9 Prince Edward Island........................................................................49

Figure 10 Alternative formulations of the same propositional content.............57

Figure 11 Syntactic sketch produced by MEG..................................................66

Figure 12 Underlying data structure for the sketch..........................................67

Figure 13 Syntactic portrait produced by MEG................................................69

Figure 14 Logical form produced by MEG......................................................70

Figure 15 Labels used in the logical form.......................................................72

Figure 16 Resolution of reflexive pronoun......................................................73

Figure 17 Resolution of personal pronoun.......................................................73

Figure 18 Data structure underlying the node drive1.......................................74

Figure 19 Echidna...........................................................................................80

xiii

Figure 20 The Subordinate Clause Condition..................................................86

Figure 21 Necessary criteria for the ASYMMETRICCONTRAST relation.............89

Figure 22 Cue to the ASYMMETRICCONTRAST relation....................................90

Figure 23 Aardwolf.........................................................................................91

Figure 24 Bossuet, Jacques Bénigne................................................................92

Figure 25 Argon.............................................................................................. 92

Figure 26 Textiles...........................................................................................93

Figure 27 Cues to the CAUSE relation..............................................................95

Figure 28 Syrdarya........................................................................................96

Figure 29 Pregnancy and childbirth.................................................................98

Figure 30 Necessary criteria for the CAUSE relation when the Subordinate Clause Condition is not satisfied..............................................................99

Figure 31 Cues to the CAUSE relation............................................................102

Figure 32 Species and speciation...................................................................103

Figure 33 Segregation in the United States....................................................104

Figure 34 Soil management...........................................................................105

Figure 35 Cues to the CIRCUMSTANCE relation..............................................106

Figure 36 Abiathar........................................................................................107

Figure 37 Africa............................................................................................ 108

Figure 38 Trafalgar, Battle of........................................................................108

Figure 39 Acuff, Roy....................................................................................109xiv

Figure 40 Cue to the CONCESSION relation....................................................110

Figure 41 Renaissance Art and Literature......................................................111

Figure 42 Aardvark.......................................................................................112

Figure 43 Adventists.....................................................................................112

Figure 44 Abolitionists..................................................................................113

Figure 45 Cue to the CONDITION relation......................................................115

Figure 46 Prince Edward Island....................................................................116

Figure 47 Prince Edward Island: Syntactic analysis......................................117

Figure 48 Pregnancy and Childbirth..............................................................118

Figure 49 Necessary criteria for the CONTRAST relation work-around...........119

Figure 50 Cue to the CONTRAST relation work-around..................................120

Figure 51 Textiles.........................................................................................120

Figure 52 Necessary criteria for the CONTRAST relation................................121

Figure 53 Cues for the CONTRAST relation....................................................122

Figure 54 Abbess..........................................................................................123

Figure 55 Primus, Pearl.................................................................................124

Figure 56 Aardwolf.......................................................................................126

Figure 57 Abrasives......................................................................................127

Figure 58 Necessary criteria for the ELABORATION relation..........................128

Figure 59 Cues to the ELABORATION relation................................................129

xv

Figure 60 Aardwolf.......................................................................................131

Figure 61 Stem.............................................................................................. 132

Figure 62 Necessary criteria for the JOINT relation........................................133

Figure 63 Religion........................................................................................134

Figure 64 Pregnancy and childbirth...............................................................135

Figure 65 Necessary criteria for the LIST relation..........................................137

Figure 66 Cues to the LIST relation................................................................139

Figure 67 Psychotherapy...............................................................................140

Figure 68 Echidna.........................................................................................142

Figure 69 Cue to the MEANS relation.............................................................143

Figure 70 Pre-Columbian Art and Architecture.............................................143

Figure 71 Cues to the PURPOSE relation.........................................................144

Figure 72 Ransome, Arthur Michell..............................................................145

Figure 73 Cues to the RESULT relation..........................................................147

Figure 74 Misparse of a detached participial clause.......................................148

Figure 75 Waterloo, Battle of........................................................................148

Figure 76 Ramsey, Norman Foster................................................................149

Figure 77 God............................................................................................... 149

Figure 78 Necessary criteria for the RESULT relation when the Subordinate Clause Condition is not satisfied............................................................150

xvi

Figure 79 Cues for the RESULT relation when the Subordinate Clause Condition is not satisfied........................................................................................ 151

Figure 80 Speech and Speech Disorders........................................................152

Figure 81 Propane.........................................................................................153

Figure 82 Necessary criteria for the SEQUENCE relation................................155

Figure 83 Logical form illustrating negative polarity....................................158

Figure 84 Acquired Immune Deficiency Syndrome.......................................158

Figure 85 Moissan, Ferdinand-Frederic-Henri...............................................159

Figure 86 Abacha, Sani.................................................................................160

Figure 87 Cues for the SEQUENCE relation....................................................162


Figure 89 World War II.................................................................................167

Figure 90 Compare dates...............................................................................168


Figure 92 Criteria for an RST terminal node..................................................174

Figure 93 Aardvark.......................................................................................174

Figure 94 Data structure of a hypothesized symmetrical rhetorical relation...177

Figure 95 Data structure of a hypothesized asymmetric rhetorical relation....178

Figure 96 Binary-branching tree for Abd-ar-Rahman excerpt.......................181

Figure 97 Binary-branching tree for Aardwolf excerpt..................................182

Figure 98 Data structure of an underspecified asymmetric rhetorical relation184xvii

Figure 99 Pseudo-code for constructing RST trees.........................................188

Figure 100 Corresponding binary and n-ary branching symmetric RST trees.194

Figure 101 Corresponding binary and n-ary branching asymmetric RST trees.............................................................................................................. 196

Figure 102 Corresponding binary and n-ary branching complex RST trees....197

Figure 103 Pseudo-code for the function BinaryToNaryTree........................198

Figure 104 Rankings of RST trees..................................................................201

Figure 105 Aardwolf.....................................................................................204

Figure 106 Analysis of the first sentence.......................................................204

Figure 107 Analysis of the second sentence..................................................205

Figure 108 Analysis of the third sentence......................................................206

Figure 109 Analysis of the fourth sentence....................................................207

Figure 110 Bags for the excerpt....................................................................209

Figure 111 Hypothesized relations for the excerpt.........................................210

Figure 112 Terminal nodes and initial projections.........................................211

Figure 113 Contents of RSTNODES after applying hypothesis 4....................212



Figure 116 Contents of RSTNODES after further processing..........................217

Figure 117 First complete RST tree for Aardwolf excerpt..............................219

Figure 118 Aardwolf.....................................................................................227xviii

Figure 119 Hypertext view of Abd-ar-Rahman text.......................................234

Figure 120 Conjunctivitis..............................................................................235

Figure 121 Hypertext view of conjunctivitis text...........................................236

Figure 122 Hypertext view of conjunctivitis text...........................................237

xix

1. Introduction

This dissertation describes a system for computing representations of the

structure of written discourse. This system, RASTA (Rhetorical Structure Theory

Analyzer), takes as its input a written representation of a text and produces as its

output a representation of the structure of that text in the form of an n-ary

branching tree of the kind used within Rhetorical Structure Theory (henceforth

RST) (Mann and Thompson 1986, 1988).

Sanders and van Wijk (1996:91) note that “Existing models for text

structure analysis tend to rely heavily on analysts’ intuitions and world

knowledge, and they are hardly formulated explicitly enough to be applied in an

objective and reliable way”. Computers are of course notorious for their lack of

linguistic intuition. If a computer is to identify discourse structure, we therefore

require maximally explicit algorithms. These algorithms ought also to be efficient

if a computational discourse analyzer is to have any utility. This dissertation

therefore addresses two distinct problems:

1

1. How can we automatically identify linguistic cues to discourse

structure?

2. How can a discourse module efficiently construct plausible

representations of discourse structure on the basis of those cues?

Due to the emphasis on natural language generation in computational work on

RST, neither of these issues has received much attention in the field of

computational linguistics.

It has been widely assumed, or even asserted, that reasoning beyond

textual form is needed to compute a representation of the structure of a text (see

sections 4 and 8.2). In contrast, the development of RASTA has been guided by a

functionalist approach to analyzing language. Writers employ linguistic resources

—morphology, the lexicon, syntax—to realize their communicative goals. In

employing these linguistic resources, a text is molded, taking on a specific form

from which it is possible to infer the writer’s communicative goals. The first

stage of RASTA’s operation addresses the first problem, “How can we

automatically identify linguistic cues to discourse structure?”, from this

functionalist perspective. RASTA examines the syntactic analysis and logical form

analysis of a text, considering such cues to discourse structure as cue phrases,

2

tense, aspect, polarity and referential continuity of noun phrases. On the basis of

these cues, RASTA posits discourse relations between clauses, associating a

heuristic score with each relation that reflects a relative confidence in the

plausibility of the discourse relation. When RASTA has finished hypothesizing

discourse relations, it commences the second stage of its analysis, a stage that

addresses the second problem, “How can a discourse module efficiently construct

plausible representations of discourse structure on the basis of those cues?”

During the second stage, RASTA assembles well-formed RST trees that are

compatible with the posited discourse relations. RASTA applies the posited

discourse relations with high heuristic scores before those with lower heuristic

scores in a bottom-up manner, grouping contiguous clauses into a hierarchical

representation. Because RASTA is guided by the heuristic scores, it rapidly

converges on the best discourse analyses for a text.

Human readers, no doubt, employ knowledge outside of a text to aid in its

interpretation, drawing on such factors as world knowledge, genre conventions

and plausible inferences. Rather than attempting to model such extrinsic

knowledge and thereby mimic the current understanding of the mental processes

of human readers, RASTA proceeds under the assumption that the text itself

3

contains sufficient clues to enable a computer to compute a feasible

representation of its discourse structure, and therefore posits discourse relations

solely on the basis of a linguistic analysis of the text. Although I do not wish to

draw unwanted mentalist inferences on the basis of what is computationally

feasible, it would not be surprising if it were to turn out that some of the

superficial cues to discourse structure employed by RASTA were also employed

by human readers. The psychological reality of the cues employed by RASTA is,

however, a matter for separate experimental investigation.

Despite the emphasis on computational considerations in this dissertation,

I hope that the results of this study will be accessible and of interest to researchers

in discourse who do not possess a computational bent.

Representations of the structure of a text are by no means an end in

themselves. Rather, such representations are expected to prove useful for future

work on information retrieval and on the automatic acquisition of knowledge.

Moreover, a reliable automated means of identifying discourse structure opens the

way to large-scale empirical analyses of discourse. For example, it would be

possible to consider, in a given genre, what linguistic devices are most commonly

used to realize particular textual relations.

4

This dissertation has the following structure. Following a brief description

of the data analyzed (chapter 2), it proceeds to a description of RST and the

modifications made to the standard theory to fit it to the task at hand (chapter 3),

then to a survey of previous work on computing representations of discourse

structure (chapter 4). Chapter 5 presents a brief overview of the Microsoft

English Grammar (MEG), within which RASTA is a component. Chapters 6 and 7

describe how RASTA identifies cues to discourse structure using the resources

available within MEG and then constructs n-ary branching RST trees on the basis

of the cues that it identifies. Chapter 8 brings together the algorithms described in

chapters 6 and 7 and clarifies RASTA’s contribution to the field of computational

discourse processing. Finally chapters 9 and 10 conclude the dissertation, pointing

to future research directions.

5

2. Data

For the present study, the data is limited to the text of the articles in

Encarta 96 (Microsoft Corporation 1995, henceforth simply Encarta), an

electronic multimedia encyclopedia of broad coverage, aimed at a general non-

specialist audience. These articles form a corpus of a little over ten million words,

in approximately 576,000 sentences.

Part of the ongoing research in the Microsoft Natural Language

Processing Research Group concerns the acquisition of knowledge from natural

language texts. An extensive semantic network, MINDNET (section 5.7),

consisting of 120,000 head words and approximately seven million labeled arcs

connecting lexical senses of those head words has been constructed automatically

by parsing dictionary definitions (Dolan 1995; Dolan et al. 1993; Richardson

1997; Richardson et al. 1993; Vanderwende 1995a, 1995b). These dictionary

definitions are typically noun phrases or single clauses. While the same

techniques used to acquire information from dictionary definitions could

reasonably be applied to individual sentences in free text, knowledge of text

structure is sure to improve the extraction of information.

6

The content of Encarta is of a high caliber, with many articles having been

contributed by recognized authorities in their fields. The content is therefore

worth acquiring into a semantic network. Furthermore, the content is non-

controversial. The articles tend to represent views and information whose

interpretation is widely accepted. This is an advantage for the task of

automatically acquiring knowledge, since the philosophically (and

computationally) difficult task of integrating and resolving conflicting

information can be avoided.

The text in Encarta has several pragmatic advantages from the point of

view of computational analysis. The most important advantage is that the articles

are well-edited: sentences are generally free of spelling or grammatical errors,

since an in-house style guide is used to ensure a high degree of consistency in

punctuation, lexical usage, and syntax. Although a computational system for

broad-coverage analysis (such as the Microsoft English Grammar described in

section 5) ought to be able to cope with occasional errors in text, a computational

syntactic analysis can be expected to achieve a high degree of accuracy if the text

is edited.

7

Although the articles are written to conform to an in-house style guide, the

discourse structure of the text exhibits great variety. The diversity of the

discourse structure has many causes. Many authors outside of the editorial team

(often specialists in a given field) have contributed articles—the article on

Language, for example, was contributed by Bernard Comrie and the article on

Native American Languages was contributed by Lyle Campbell. The diverse

subject matter of the articles in Encarta also motivates the diverse discourse

structure. For example, there are descriptions of physical objects, accounts of

historical battles, and explanations of religious views. Even within a single

article, however, there can be considerable complexity in the discourse structure.

Facts are not merely listed, they are presented in a coherent manner.

In conclusion, the text of Encarta, although edited to conform to a style

guide, exhibits great variety. The articles in Encarta are intended to be read by

non-specialists, and take the form of coherent texts. In section 8.5, I consider how

the research presented in this dissertation might need to be extended to apply to

other genres.

8

3. Rhetorical Structure Theory

3.1 Introduction

RST was developed during the 1980s by researchers in natural language

generation, many of whom were then involved with projects at the Information

Sciences Institute in Southern California. Since much of my research is informed

by the theoretical approach taken within RST, it is first necessary to outline the

theory, criticisms of it, and the modifications which I have made to adapt it to my

purposes.

3.2 Overview

RST (Mann and Thompson 1986, 1988) models the discourse structure of

a text by means of a hierarchical tree diagram. The terminal nodes of an RST tree

are propositions encoded in text. (Although RST analysts usually take care to

distinguish contiguous stretches of text, termed text spans, from the propositions

expressed in the text, in the discussion below I will simply refer to text spans.)

Non-terminal nodes represent contiguous text spans, whose daughter spans are

9

joined by discourse relations. These discourse relations are of two kinds:

symmetric and asymmetric.

A symmetric relation involves two or more text spans, each of which is

equally important in realizing the writer’s goals. By convention, each of these text

spans is labeled a nucleus. Figure 1 illustrates one kind of symmetric relation, the

Sequence relation.1 Straight lines are used to represent the connection between the

child nodes of a symmetric relation to their parent node.

1 Unless otherwise indicated, all examples in this dissertation are taken from Encarta.

10

1. Napoleon met defeat in 1814 by a coalition of major powers,

notably Prussia, Russia, Great Britain, and Austria.

2. Napoleon was then deposed

3. and exiled to the island of Elba

4. and Louis XVIII was made ruler of France.

Figure 1 Waterloo, Battle of

An asymmetric relation involves exactly two text spans. One text span, the

nucleus, is more important in realizing the writer’s goals. The other text span, the

satellite, is in a dependency relation to the nucleus, modifying it in ways specified

in the definition of the particular relation. Figure 2 illustrates one kind of

asymmetric relation, the ELABORATION relation. A labeled arc is used to represent

the connection between the satellite and the nucleus. The arrowhead on the arc

points to the nucleus.

11

1. In most cases, Pseudepigrapha are modeled on canonical

books of a particular genre.

2. For example, Judith is inspired by the historical books of the Old

Testament.

Figure 2 Pseudepigrapha

Although units as large as paragraphs, sections or chapters may be used as

terminal nodes for a coarse-grained analysis, the terminal nodes of an RST tree are

usually clauses with “independent functional integrity” (Mann and Thompson

1988:248). Restrictive relative clauses, which by definition serve to modify a

head noun and are therefore not directly in significant discourse relations to other

clauses, do not qualify as minimal textual spans under this criterion. (I have also

chosen to disregard non-restrictive relative clauses on the grounds that they also

12

serve to modify a head noun.) Similarly, clausal subjects and complements do not

qualify as terminal nodes.

A nucleus or a satellite may be a tree with internal complexity. RST thus

claims that the same structural representation can be used for the relationship

between two adjacent clauses, or for the relationship between any two arbitrarily

large text spans. Figure 3 illustrates a plausible RST tree to represent the structure

of a brief excerpt from Encarta 96. In Figure 3, a RESULT relation connects two

text spans—the span consisting of clauses 1 and 2 and the span consisting of

clauses 3 and 4—where each text span has internal structure, represented as an

RST subtree. (See Mann and Thompson (1988) for definitions of the relations

employed here.)

13

1. Nelson, however, surprised his adversary

1. by ordering his ships into two groups, each of which assaulted

and cut through the French fleet at right angles, demolishing the

battle line;2

2. this bold strategy created confusion,

2. giving the British fleet an advantage.

Figure 3 Trafalgar, Battle of

2 A few relative clauses in Encarta contain “mini-discourses”. In this example, there is

a SEQUENCE relation between assaulted and cut through the French fleet at right angles, with a

RESULT relation between this SEQUENCE and the clause demolishing the battle line. To avoid

excessive granularity, I do not construct RST analyses within relative clauses, although in

principle the same techniques could be used to construct representations for these mini-

discourses.

14

3.3 Conditions on the structure of trees

Four criteria determine the well-formedness of an RST tree (Mann and

Thompson 1988):

1. Completeness: a single tree covers the entire text.

2. Connectedness: each text span in the text, with the exception of the text span

which covers the entire text, is a node in the tree.

3. Uniqueness: text spans have a single parent.

4. Adjacency: only adjacent text spans can be grouped together to form larger

text spans.

3.4 Formalizing the relations

Four parameters are used in describing RST relations (Mann and

Thompson 1988:245):

15

1. Constraints on the nucleus

2. Constraints on the satellite

3. Constraints on the combination of nucleus and satellite

4. The effect

Figure 4 gives the definition of the VOLITIONAL CAUSE relation (Mann

and Thompson 1988:274-275). N stands for nucleus, S for Satellite, W for

Writer, and R for Reader.

16

Relation name: VOLITIONAL CAUSE

Constraints on N: presents a volitional action or else a situation that could

have arisen from a volitional action

Constraints on S: none

Constraints on the N+S combination:

S presents a situation that could have caused the agent of

the volitional action in N to perform that action;

without the presentation of S, R might not regard the action

as motivated or know the particular motivation;

N is more central to W’s purposes in putting forth the N-S

combination than is S.

The effect: R recognizes the situation presented in S as a cause for the

volitional action presented in N

Locus of the effect: N and S

Figure 4 Definition of the VOLITIONAL CAUSE relation

As the epistemic modal phrases could have and might not make clear, the

definition of an RST relation allows for subjective evaluation on the part of the

17

analyst. The analyst proposes a judgment of the plausibility of suggesting that the

writer intended a certain effect (Mann and Thompson 1988:245). This

subjectivity is mitigated by the fact that different analysts tend to agree in their

analyses, or at least to be able to see the validity of one another’s analyses (Mann

and Thompson 1988:265).

It is important to note that the description of an RST relation does not

include a description of the linguistic forms employed to realize the relation.

Indeed, Mann and Thompson (1986:68, 70-72) note that “relational propositions

arise in a text independently of any specific signals of their existence”. Mann and

Thompson (1986:71-72) even go so far as to suggest that a search for subtle

correlates of discourse relations is futile, a claim to which I return in section 4.2.

3.5 The set of relations

Although there is widespread acceptance by advocates of RST and

advocates of other theories of discourse (among them, Ballard et al. 1971; Grimes

1975; Halliday and Hasan 1976; Longacre 1976; Hobbs 1979) that relations of

the type proposed by RST are useful for describing the structure of discourse,

several questions arise:

18

1. How many relations are there?

2. How do we justify a particular set of relations?

3. How are the relations organized?

In answer to the question “How many relations are there?”, Hovy (1990)

identifies a total of approximately 350 relations which have been posited in the

linguistics, philosophy, and artificial intelligence literature. Within RST, for

example, Mann and Thompson (1986) propose fifteen relations, Mann and

Thompson (1988) propose twenty-three, and Fox (1987) proposes thirteen. It is

not the case however that Mann and Thompson (1988), in which twenty-three

relations are proposed, contains a superset of the relations in Mann and

Thompson (1986) or Fox (1987).

Hovy distinguishes a Parsimonious Position, advocated by Grosz and

Sidner (1986) in their work on Centering and Focusing, which posits two very

basic relations, Dominance and Satisfaction-Precedence. These two relations are

claimed to be sufficient for describing speaker intentions in discourse. Indeed,

Grosz and Sidner (1986) claim that it is futile to try to identify a larger finite set

19

of relations, since closer inspection always reveals increasingly subtle semantic

nuances.

Although two broad relations might be sufficient to describe the

intentional structure of discourse, they have been found to be insufficient for the

computational generation of natural language (McKeown 1985; Hovy 1988,

1990). This inadequacy motivates what Hovy labels the Profligate Position,

whose adherents claim that some tens of relations are needed to adequately

describe the structure of discourse.

One such profligate position is that advocated by Mann and Thompson

(1988), who suggest classifying some relations as primarily concerned with

subject matter and others as primarily presentational, although they decline to

impose a single taxonomy on the set of relations which they posit. Others,

however, have attempted to devise taxonomies of discourse relations. Hovy

(1990), for example, proposes a taxonomy which he claims subsumes the

approximately 350 discourse relations posited in the literature he surveys. Hovy’s

three-way top-level branches Elaboration, Enhancement and Extension are taken

from the expansion types of complex clauses within Systemic Functional

Grammar (Halliday 1985), a linguistic theory which has had considerable

20

influence on RST. Hovy (1990:133) cites the inclusiveness of cue words and

phrases as evidence of the correctness of the taxonomy. Cue words and phrases

associated with a node can be felicitously used in realizing relations occurring as

daughters of that node, but cue words and phrases cannot necessarily be

felicitously used with relations occurring as sister or parent nodes. For example,

the conjunction then is associated with the SEQUENCE relation, and can be used

for its daughter relations, as examples (1) and (2) (Hovy 1990:133) show. (The

grammaticality judgments given below are Hovy’s.)

(1) SEQTEMPORAL: First you play the long note, then the short ones.

(2) SEQSPATIAL: On the blue wall I have a red picture, then a blue

one.

The cue words after and beside, however, are limited to the

SEQTEMPORAL and SEQSPATIAL relations respectively, as examples (3) and (4)

(Hovy 1990:133) show.

(3) SEQTEMPORAL: After/*Beside the long note you play the short

ones.

(4) SEQSPATIAL: Beside/*After the red picture is the blue one.

21

Maier and Hovy (1991) reject the taxonomy of Hovy (1990) on the

grounds that it fails to recognize the communicative differences between the

various relations. Instead, they propose a three-way top-level distinction based on

the three meta-functions of language within Systemic Functional Grammar:

Ideational: reflecting facts about the world

Interpersonal: involving the reader’s attitudes towards the propositional

content

Textual: purely for presentational purposes

Maier and Hovy’s taxonomy is an elaboration of the subject matter versus

presentational distinction in Mann and Thompson (1988), with Ideational

corresponding to Mann and Thompson’s subject matter relations and Textual

corresponding to Mann and Thompson’s presentational relations.

Like Maier and Hovy (1991), Wu and Lytinen (1990) propose a three-way

classification of RST relations for persuasive texts such as advertisements. They

classify the relations according to a semantic analysis into three speech actions:

clarify, make adequate, and remind.

22

The motivation for Wu and Lytinen’s (1990) taxonomy is unclear. Indeed

the primary justification for the set of relations in various works on RST is that the

relations posited are descriptively adequate. Mann and Thompson (1988:259), for

example, cite the thousands of clauses that they have successfully analyzed from a

range of genres as evidence for the efficacy of RST. Others (Sanders 1992;

Sanders et al. 1992, 1993; Knott and Dale 1995; Sanders and van Wijk 1996) find

descriptive adequacy to be an unsatisfying primary justification for a set of

rhetorical relations, and instead prefer to view relations as psychological

constructs. Knott and Dale (1995), for example, note that “descriptive adequacy”

is only meaningful if there is a clear purpose for which the descriptions must be

adequate. Moreover, they observe that the sets of relations posited by researchers

within RST are not as diverse as might be expected if descriptive adequacy were

the only criterion, suggesting that analysts are relying on their intuitions in

formulating plausible sets of relations.

Sanders, Spooren and Noordman (1992, 1993), propose a set of cognitive

primitives that can be combined to yield various classes of discourse relations.

Finer distinctions could be made by adding parameters. The primitives concern

the causal nature of the relation, whether the relation is coherent on semantic or

23

pragmatic grounds, whether a relation has a basic order, with the antecedent on

the left, or a non-basic order, and whether the polarity of the relation is positive

or negative (involving a violation of expectations). Sanders, Spooren and

Noordman tested the validity of their relations by having analysts apply the

relations to texts, and by having non-linguists decide among cue-phrases for texts.

Knott and Dale (1995) criticize the basis of some of Sanders et al.’s

parameters, especially the notion of basic order. Knott and Dale also criticize the

coarse-grained classes of relations described by Sanders et al.’s parameters, and

the notion that adding more parameters would continue to yield neatly divided

relations. Despite these criticisms, Knott and Dale still prefer to view relations of

the type posited in RST as psychologically valid. Whereas Hovy (1990) proposes

using the generalizability of cue words and phrases simply as a test of the validity

of a taxonomy of discourse relations, Knott and Dale use this test as a means to

construct a taxonomy of discourse relations from the ground up, noting that

“linguistic devices (in particular, cue phrases) can be taken as evidence for

relations, provided these are conceived as constructs which people actually use

when creating and interpreting text” (Knott and Dale 1995:46). They comment

that

24

“Studying the means available for marking relations in a given

language should be able to tell us about the relations which people

actually make use of. The methodology might be described in

Hallidayan terms, as using the cohesive devices a language affords

as evidence for a psychological theory of text coherence.” (Knott

and Dale 1995:45, original emphasis)

Knott and Dale propose a method for isolating cue phrases and a method for

testing the generalizability of those cue phrases, and then construct a taxonomy of

discourse relations. They find analogues of the original RST relations SEQUENCE,

CONTRAST, CIRCUMSTANCE, CAUSE and RESULT. Interestingly, they find no basis

for the distinction in RST between VOLITIONAL-RESULT and NON-VOLITIONAL-

RESULT, no cue phrases associated with EVALUATION or BACKGROUND, and no

single phrase associated with ELABORATION.

In section 3.6 I outline and motivate the set of relations used in the present

study, and suggest how the relations might be organized and applied within what

I will term an “underspecified” view of RST.

25

3.6 Underspecified Rhetorical Structure Theory

Vander Linden (1993:6) observes that “Instructional text tends to have a

fairly simple intentional structure, and a more complex rhetorical one”. Similarly,

Maier and Hovy observe that

“In most cases, we believe, ideational and textual relations are

subordinated to interpersonal ones (that is, they structure a

discourse that is motivated by and fulfills an interpersonally

related communicative function). … In general, an interpersonal

text plan is pursued until, at some point, it bottoms out into a call

for the presentation of information…in the extreme case it is even

possible that a whole text is governed by a single DESCRIBE, as

with encyclopedia entries.” (Maier and Hovy 1991:6, emphasis

added)

This is certainly true of articles in Encarta 96. Articles are structured almost

exclusively in terms of ideational and textual relations subordinated to a speech

act like DESCRIBE or EXPLAIN.

26

Of the original set of RST relations (Mann and Thompson 1986, 1988) the

following interpersonal relations do not appear to be needed at all for an adequate

analysis of articles in Encarta 96: ANTITHESIS, ENABLEMENT, EVALUATION,

INTERPRETATION, MOTIVATION, and SOLUTIONHOOD. In fact the only

interpersonal relation which is needed for an analysis of articles in Encarta 96 is

CONCESSION.

One criticism of RST is that a text can simultaneously have both an

intentional and an informational representation (Ford 1986; Moore and Pollack

1992), but that these intentional and informational representations will not

necessarily have the same structure. RST is not able to represent this possible

mismatch between intentional and informational representations, since it requires

that an analyst choose one relation to relate two text spans, necessitating a choice

between either an intentional relation or an informational relation. For the articles

in Encarta 96, since ideational and textual relations predominate, this criticism is

of less concern.

At least the following thirteen relations appear to be needed for the

analysis of Encarta 96 articles: ASYMMETRICCONTRAST, CAUSE, CIRCUMSTANCE,

CONCESSION, CONDITION, CONTRAST, ELABORATION, JOINT, LIST, MEANS,

27

PURPOSE, RESULT, SEQUENCE. As per Knott and Dale (1995), I do not distinguish

VOLITIONAL RESULT from NON-VOLITIONAL RESULT. I also do not distinguish

VOLITIONAL CAUSE from NON-VOLITIONAL CAUSE. These thirteen relations are a

relatively uncontroversial common subset of all the rhetorical relations that have

been proposed (Hovy 1990 and the references therein). Not only are these

relations uncontroversial, they are also ones that can reliably be identified by

automatic means (chapter 6). The approach and insights outlined in this

dissertation would not be nullified should a different set of relations be used. For

example, if the CAUSE relation were broken down into VOLITIONAL CAUSE and

NON-VOLITIONAL CAUSE (as per Mann and Thompson 1988) and if linguistic

cues could be found which reliably identified each of these two relations, then the

remaining architecture outlined below would remain unchanged. In particular, the

algorithm that constructs RST trees on the basis of a set of hypothesized discourse

relations (chapter 7) is not sensitive to the peculiar attributes of specific relations,

and would therefore still operate if a different set of relations were used.

With the exception of the interpersonal relation CONCESSION, these

thirteen relations cannot be usefully distinguished by appeal to the three meta-

functions of language used by Maier and Hovy (1991). Instead, I propose the

28

following simple taxonomy which makes a two-way top-level distinction between

symmetric and asymmetric relations:

Figure 5 Taxonomy of discourse relations

29

In view of the discussion in section 3.5 concerning debates in the literature about

sets of relations and taxonomies of those relations, some discussion of the top-

level distinction between symmetric and asymmetric relations is in order.

Mann and Thompson (1988) suggest that the distinction between a

nucleus and a satellite reflects differences in the organization of text. The nucleus

is “more deserving of response” (Mann and Thompson 1988:270), while the

satellite gains its significance only in relation to a nucleus. In an asymmetric

relation, the decision to encode something as a nucleus reflects the relative

importance of the proposition in expressing the writer’s goals, as opposed to the

ancillary status of satellite material. In a symmetric relation, equal importance is

attached the propositions expressed in all the daughter nodes. In this sense, the

nodes in a symmetric relation can all be considered to be nucleic.

In Encarta 96, it is usually clear whether a text span stands as a nucleus or

a satellite to another text span, or whether in fact no direct discourse relation

holds between the two spans. It is occasionally less clear exactly what relation

ought to be posited. Marcu (1997a) makes a similar observation concerning the

RST trees constructed by two analysts for five small texts. The analysts tended to

agree about which nodes were nuclei and which were satellites, even if they

30

differed in the labels they assign to the relationships linking those nodes. This

suggests that the task of constructing discourse representations can be broken

down into two components: identifying whether a symmetric or asymmetric

relation holds, and labeling that relation. For a computational system, it is

desirable to avoid the construction of great numbers of trees which have the same

shape but differ only in their labeling. Rather than construct many trees with the

same structure, RASTA represents these alternatives as a list of labels on a given

node. The use of a list of labels is intended to represent the indeterminacy among

the various labels, not to suggest that all of the labels apply simultaneously. In

this sense, the trees are underspecified: the overall structure of the tree is given,

but labeling can vary from determinate, i.e. a single label is the most plausible to

less determinate, i.e. multiple labels are plausible. There is a third, albeit rare,

possibility: no label is plausible, the implications of which I consider now.

In rare instances, it is not clear that any label is appropriate, although a

symmetric versus asymmetric distinction can be made. In these cases, RASTA

labels the relation with a question mark, as illustrated in Figure 6.

31

1. The legs have powerful claws,

3. adapting the animal for rapid digging into hard ground.

Figure 6 Echidna

This then raises the question of why a simple distinction between symmetric and

asymmetric relations is not sufficient. That is, why does RASTA attempt a finer-

grained analysis? The attempt to identify meaningful labels for relations wherever

possible is motivated by the uses to which the output of a computational system

like RASTA might be put (chapter 9). Text summarization, information retrieval,

and the extraction of information from written text would all benefit from

meaningfully labeled relations. For example, to locate a section of text within a

document which might answer the question “Why…” it is useful to distinguish a

CAUSE relation.

Much of the emphasis in the literature on constructing elaborate

taxonomies is motivated by issues to do with the computational generation of

32

natural language, in which text planners make ever finer decisions concerning the

organization of material, until they terminate in decisions about specific

grammatical encoding. In contrast to this, discriminating among a relatively small

number of discourse relations can be achieved by simply attempting to recognize

each discourse relation, or at least recognizing whether there is a symmetric or

asymmetric relation. This process can be carried out without reference to

elaborate taxonomies.

3.7 Schemas

RST relations are organized into schemas. Mann and Thompson

(1988:247) give the following five schemas:

33

Figure 7 RST schemas (Mann and Thompson 1988:247)

Schema (1) represents what I term the asymmetric relations. Schemas (2),

(3) and (5) correspond to what I term the symmetric relations; the CONTRAST

relation is conversive, whereas the JOINT and SEQUENCE relations are not. The

JOINT relation does not posit a contentful relationship between its daughter nodes,

and so lacks the arcs connecting those nodes. Finally, schema (4) is of a type not

34

found in Encarta 96. Fox (1987) describes another schema which she calls an

Issue, consisting of a nucleus and several satellites. It must be emphasized that

Fox is consistent with Mann and Thompson in viewing these schemas as a

structural classification of the classes of relations, rather than as a notation for

representing recurrent combinations of relations in discourse. Neither Fox nor

Mann and Thompson suggest (to give a hypothetical example) that a

CIRCUMSTANCE relation is more likely to hold between a simple text span and a

text span with the internal structure of a SEQUENCE than between a simple text

span and a text span which is the rightmost node in a SEQUENCE relation, i.e. that

example (1) in Figure 8 is more likely than example (2):

Figure 8 Alternative discourse structures

35

Sumita et al. (1992) propose restrictions on thinking flow. These are linear

sequences of relations which are held to be indicative of well-formed RST trees,

and which are used to constrain the number of RST trees which are constructed

for a text. The sequences of relations which they propose appear to be hand-

crafted, based on the intuitions of linguists. Although Sumita et al. claim

improvements in their system resulting from the application of these thinking

flow restrictions, it is not clear that the restrictions are empirically well motivated

(see section 4.2).

It ought to be possible to identify recurrent configurations of RST relations

for a given genre. For example, it may turn out to be the case that for

encyclopedia articles about historical battles, there is frequently a SEQUENCE

relation between several clauses, with a RESULT relation modifying the last of the

nuclei in the SEQUENCE. A set of configurations could be used to constrain the

discourse structures created by RASTA. Since the identification of such recurrent

configurations requires analyses of a great many texts, I suggest this as a

promising avenue for future research given a reliable automated means of

computing discourse representations (chapter 9). Since RASTA is able to reliably

compute representations of discourse structure without reference to such macro-

36

structures, the main benefit of schemas would be to improve the efficiency of

RASTA by constraining the range of structures that it might consider in arriving at

the preferred analysis.

3.8 Conclusion

For a limited domain, namely articles in Encarta, a set of thirteen

rhetorical relations suffices. For the task of constructing plausible representations

of discourse structure, it is not necessary to devise an elaborate taxonomy for

those relations. Rather, a simple distinction between symmetric and asymmetric

relations suffices.

The theoretically problematic issue of the occasional uncertainty

concerning the appropriate label to apply to a relation motivates an underspecified

representation. This underspecified representation is computationally attractive

because it allows a condensed representation of multiple trees having the same

structure but differing in labeling. The identification of recurrent configurations

for specific genres might lead to novel ways to constrain the search for a

preferred RST analysis for a text, but is not essential for RASTA to reliably

construct RST representations.

37

4. Previous Work on Computing Discourse

Representations

4.1 Introduction

In the literature, there are few descriptions of algorithms for computing

discourse representations, and still fewer descriptions of implementations of such

algorithms. In the following sections I briefly review this work.

4.2 Rhetorical Structure Theory

4.2.1 Mann and Thompson (1986, 1988)

Mann and Thompson (1986, 1988) recognize that rhetorical relations are

often signaled by cue words and phrases, but emphasize that rhetorical relations

can still be discerned even in the absence of such cues. From this it follows that

for written text in general any attempt to construct a representation of discourse

solely on the basis of cue words and phrases is doomed to failure. Despite this

pessimistic prognosis, various researchers have attempted to model rhetorical

structure, sometimes solely on the basis of such superficial cues.

38

4.2.2 Sumita et al. (1992) and Ono et al. (1994)

Researchers at Toshiba Corporation (Sumita et al. 1992; Ono et al. 1994)

analyzed Japanese and English texts and attempted to construct representations of

discourse structure based on referential continuity (determined by simple lexical

repetition) and cue words and phrases. Their analysis appears to be based on

extremely simple pattern matching of strings, rather than a full syntactic analysis.

A flat structure is produced, representing the relations between adjacent

sentences. Hierarchical structure is then constructed over this flat representation

according to constraints on thinking flow, defined as plausible sequences of

relations. Sumita et al. give an example of one such thinking flow restriction:

“Consider the sequence [P <EG> Q <SR> R], where P, Q, R are

arbitrary (blocks of) sentences. The premise of R is obviously not

only Q but both P and Q. Since the argument in P is considered to

close locally, the two should be grouped into a block.” (Sumita et

al. 1992:1134, EG = exemplification, SR = serial connection)

As noted in section 3.7, these thinking flow restrictions apparently result from the

intuitions of linguists, rather than being deduced empirically. In addition to these

39

thinking flow restrictions, a set of template strings is used to evaluate discourse

structures. An example of such a template string is “…? …? The reason is, …”

(Sumita et al. 1992:1135), a shorthand notation which is interpreted as a string of

characters ending in a question mark, followed by another string of characters

ending in a question mark, followed by the string “The reason is,”. Again, these

templates are apparently based on the intuitions of linguists, rather than being

deduced from empirical analysis of texts. Although Sumita et al. (1992) compare

the output of their system to human analyses they do not describe the specific

contribution of the thinking flow restrictions and template strings in guiding their

system to plausible analyses of text structure.

The data for the Toshiba research are newspaper articles and short

academic articles. Ono et al. (1994) observe that the academic articles contain

many more cue phrases than the newspaper articles, which enables them to more

accurately construct representations of discourse structure, and therefore to

construct qualitatively better summaries based on those discourse representations.

40

4.2.3 Kurohashi and Nagao (1994)

In a similar vein to the research at Toshiba, Kurohashi and Nagao (1994)

create discourse structures which (to judge by their illustrations) are similar to the

structures posited by RST. They create these structures by examining cue words,

topic-chains identified by lexical repetition, and by metrics that measure the

similarity for two sentences (apparently determined by word repetition, thesaural

relations and patterns of sequences of parts of speech). Kurohashi and Nagao base

their research on an odd data-set: translations into Japanese of articles in an

English language popular science magazine. They give few details of their

system, how they measure similarity, how they construct hierarchical

representations once they have identified cues to discourse structure, or even

exactly what those cues are.

4.2.4 Fukumoto and Tsujii (1994)

Fukumoto and Tsujii (1994) sketch a formalism for constraining the

selection of one of four interpersonal relations (BACKGROUND, ENABLEMENT,

EVIDENCE, MOTIVATION) according to the tense, aspect and modality of the

clauses between which a relation is being posited. Unfortunately, the formalism

41

involves subjective evaluation of such things as whether the outcome of a

situation is “good” or “bad” (Kurohashi and Nagao 1994:1182). Kurohashi and

Nagao’s examples of the application of this formalism to a text appears to be

based on a hand-analysis. Although some aspects of the identification of RST

relations are made explicit, the formalism does not appear to be sufficiently

explicit for a computational implementation. Finally, Kurohashi and Nagao do

not present a general method for constructing RST representations for a text once

relations have been identified by employing their formalism.

4.2.5 Wu and Lytinen (1990)

Wu and Lytinen (1990) briefly describe the BUYER system, which deduces

coherence relations from a propositional representation of an advertisement.

Although details of their system are sketchy (their description of the control flow

of their system contains such steps as “Decide implicational or semantical

relations and coherence relations.’, Wu and Lytinen 1990:508), it does not appear

to contain explicit procedures for dealing with multi-nucleic relations, any way to

decide among alternative possible coherence relations, nor any way to evaluate

alternative trees that might be constructed.

42

4.2.6 Marcu (1996, 1997a)

Marcu (1996) provides a first-order formalization of RST trees, along with

an algorithm for constructing all the RST trees compatible with a set of

hypothesized rhetorical relations for a text. Marcu employs the notion of

nuclearity in developing his algorithm for constructing RST trees. As Marcu

observes, two adjacent text spans can be related by an RST relation if and only if

that relation holds between the nuclei of the two text spans; satellites of the text

spans do not enter into the determination of this relationship. RST trees can thus

be assembled from the bottom up by joining text spans whose nuclei have been

posited to be potentially in some rhetorical relationship. Given a set of rhetorical

relations that might hold between pairs of RST terminal nodes, Marcu’s algorithm

will produce all of the valid RST trees which are compatible with the relations

posited.

Marcu’s algorithm suffers from combinatorial explosion–as the number of

relations increases, the number of possible RST trees increases exponentially.

Marcu first produces all possible combinations of nodes according to the relations

posited and then filters ill-formed trees.

43

Marcu (1996) leaves two questions unanswered. First, on what basis might

a computational system posit the relations that the algorithm then uses as the basis

for constructing RST trees? Second, what criteria should a computational system

use for evaluating alternative well-formed RST trees in an effort to determine

which trees might be more plausible? Marcu (1997a) attempts to answer both of

these questions.

Marcu (1997a) identifies cue phrases that are compatible with various

rhetorical relations, distinguishing rhetorical uses of those phrases from sentential

uses. Marcu identifies these cue phrases in a text by means of a shallow analysis,

essentially pattern matching based on regular-expressions. In the process of

identifying the cue phrases, Marcu also identifies clause boundaries. On the basis

of the cue phrases, Marcu’s algorithm posits rhetorical relations between the

clauses identified. These rhetorical relations are then used to assemble RST

representations as per Marcu (1996). Finally, Marcu’s algorithm evaluates the

RST trees constructed according to a metric that favors trees that skew to the right

(see below, this section).

Although Marcu’s algorithm for constructing RST representations

represents a considerable advance, it is not without its problems. Some of these

44

problems, discussed below, result from an over-reliance on cue phrases and the

use of pattern matching to identify cue phrases and terminal nodes. These

problems are perhaps true of other similar methods described in the literature (in

particular, Ono et al. 1994 and Sumita et al. 1992), but Marcu gives the clearest

description of these techniques and provides an examination of their efficacy.

Marcu’s method for evaluating trees is perhaps not sufficiently genre-

independent, while the algorithm for constructing trees suffers from

combinatorial explosion—as the number of hypothesized relations increases, the

number of well-formed trees produced increases exponentially.

An overreliance on cue phrases as evidence for discourse structure makes

it difficult to ensure that a computational discourse analyzer will be able to

construct a representation that completely covers the text (Mann and Thompson’s

criterion 1, section 3.3). As Mann and Thompson (1986, 1988) note (see section

4.2.1), rhetorical relations can be discerned even in the absence of cue phrases. If

most clauses contained cue phrases that could be used as evidence for discourse

structure, then a computational discourse analyzer might still be able to achieve

analyses of large fragments of a text by relying exclusively on cue phrase

identification. Redeker (1990) examines transcripts of oral retellings of films, and

45

finds that approximately 50% of all tensed clauses contain cue phrases that

function as discourse markers. Marcu (1997a:97) interprets this percentage as

“sufficiently large to enable the derivation of rich rhetorical structures for texts.”

Leaving aside the differences that might exist between oral and written texts, a

more pessimistic evaluation might lead to concern about the clauses that do not

contain cue phrases. Even in academic discourse, a genre where we might expect

a high density of cue phrases, it is clearly not the case that every clause contains a

cue phrase that explicitly indicates its discourse relation to other clauses. Marcu’s

algorithm contains no criteria for positing rhetorical relations in the absence of

cue phrases. (Of course, cue phrases, when present, are a compelling form of

evidence for identifying discourse structure. RASTA also identifies cue phrases,

but uses additional cues in the identification of discourse relations, as described in

chapter 6).

Identifying cue phrases by means of regular expressions yields a fairly

high degree of accuracy. Using the terminology of information retrieval, Marcu

measures recall (the number of things judged by humans to be cue phrases that

were also identified by his algorithm) at 80.8% for the 275 cue phrases identified

manually in a test corpus, and precision (the number of cue phrases identified by

46

his algorithm that a human actually judged to be cue phrases) at 89.5%. The value

of 89.5% for precision can be attributed to the incorrect identification of cue

phrases as having a discourse function when in fact they had a sentential function,

such as coordinating two noun phrases.

Although pattern-matching is generally computationally inexpensive, it

has two problems. The first problem concerns the compositionality of cue

phrases. In Encarta, there are some sequences of words that function in certain

contexts as cue phrases equivalent to single lexical items, but in other contexts as

phrases whose internal structure is important. In the following example, the

phrase as long as ought to be treated as having internal syntactic structure.

…their observed light would have been traveling practically

as long as the age of the universe. (Quasar)

In contrast, Figure 9 illustrates the same sequence of words as long as acting as a

subordinating cue phrase. The MEG system, in the course of performing a

syntactic analysis, correctly distinguishes the compositional analysis above from

the analysis in Figure 9, in which the cue phrase acts as a single lexical item (see

section 6.7.5). A pattern matching approach like the one described by Marcu

would have difficulty dealing with such cases.

47

1. The premier and cabinet remain in power

4. as long as they have the support of a majority in the provincial

legislature.

Figure 9 Prince Edward Island

A second problem with an approach based on pattern matching concerns

the identification of terminal nodes for a discourse analysis. Many of the terminal

nodes in Marcu’s (1997a) diagrams, for example, are not clauses, and would not

be treated as terminal nodes in a conventional RST analysis. For example, for the

first sentence of an excerpt from Scientific American, Marcu’s (1997a:101)

procedure selects the following three terminal nodes (the second node is offset in

the original sentence by em dashes):

48

1. With its distant orbit—(2)—and slim atmospheric blanket,

2. 50 percent farther from the sun than the Earth

3. Mars experiences frigid weather conditions.

(Marcu 1997a:101)

Marcu’s nodes (1) and (2) would certainly not be selected as clauses with

“independent functional integrity” (Mann and Thompson 1988:248). Marcu does

not discuss the fact that some of the nodes identified by his algorithm differ in

kind from those conventionally identified in RST. Clearly, however, Marcu’s

regular-expression based approach to identifying clause boundaries is not without

its problems.

Marcu (1997a:99) claims an average recall for the clause identification

procedure of 81.3% (i.e. 81.3% of the clauses identified by humans were also

correctly identified by his procedure), noting that it was particularly difficult to

distinguish sentential versus non-sentential uses of the conjunction and. In

Marcu’s data, the missed discourse uses of and tend to correspond to SEQUENCE

and JOINT relations. Missing these uses therefore tends not to materially affect the

analysis, but rather to lead to an RST analysis of a coarser granularity. The

49

precision of the clause identification procedure Marcu gives as 90.3%, although it

is not clear whether he counts the unusual terminal nodes mentioned above as

adding to or subtracting from the precision.

Concerning the evaluation of alternative RST analyses, Marcu (1997a)

claims that right-branching structures ought to be preferred because they reflect

basic organizational properties of text. In fact, the success of this metric reflects

the genre of Marcu’s three test excerpts. Two of the test excerpts are from

magazines, which are widely known to have a concatenative structure, as Marcu

(1997a:100) himself observes. The third text is a brief narrative, whose right-

branching structure is perhaps a reflection of iconic principles of organization

(Haiman 1980). In a narrative, the linear order of foreground clauses matches the

temporal sequence of events (Labov 1972, Polanyi 1982). Narratives can thus be

said to “unfold” in a right-branching manner.

Finally, combinatorial explosion is still a significant problem in the more

complete algorithm described in Marcu (1997a). As the number of hypothesized

relations increases, the number of well-formed trees compatible with those trees

increases exponentially. The final output of Marcu’s algorithm is a list of trees

ranked according to his metric. In order to obtain these rankings, however,

50

Marcu’s algorithm might have to produce great numbers of dispreferred trees.

The production of these dispreferred trees is essentially wasted computation.

4.3 PISA

Sanders and van Wijk (1996), whose research focuses on the mental

representation and processing of texts, side with those who believe that coherence

relations in discourse ought to be cognitively motivated (Sanders 1992; Sanders et

al. 1992, 1993; Knott and Dale 1995). In order to construct representations of

texts, Sanders and van Wijk want a theory of discourse relations that is

sufficiently explicit to allow a representation to be constructed directly from a

text, without extensive reference to real world knowledge or to the intuitions of

the analyst. They note that:

“Rhetorical Structure Theory approaches text structure in a rather

static way. An analysis always starts from an inspection of the

entire text. The analysis does not proceed in a fixed order; it can be

applied bottom-up (from relations between clauses to the level of

text), top-down (from text to clause level) or following both routes

51

(Mann et al., 1992)… Rhetorical Structure Theory lacks a

procedure.” (Sanders and van Wijk 1996:94)

It is clearly not the case that readers or hearers suspend their analysis of a

text until they have inspected it in its entirety. Therefore, Sanders and van Wijk

want a procedure which is able to incrementally analyze a text, integrating

successive utterances into an emerging representation of discourse structure. They

develop an algorithm which they call PISA, Procedure for Incremental Structure

Analysis.

Within PISA, the text is first parsed and tagged. Analysis then proceeds

one text segment at a time, asking the following four questions:

52

“1. What segment features underlie its connection to the text?

2. To which other segment does the segment connect?

3. What is the hierarchical position of this connection?

4. What is the relational meaning of this connection?”

(Sanders and van Wijk 1996:99)

Sanders and van Wijk claim that their algorithm is able to proceed on the

basis of superficial linguistic evidence, and that it does not require explicit

reference to real-world knowledge. We would therefore expect a computational

implementation of their procedure to be a straightforward affair. Although

Sanders and van Wijk have not implemented this procedure themselves, they

mention (Sanders and van Wijk 1996:endnote 3) the work of Els van der Pool,

who has implemented PISA in Common Lisp. Unfortunately, no details of this

implementation are given.

Sanders et al. (1992) propose a set of cognitive primitives that can be

combined to yield a taxonomy of discourse relations. Sanders et al. distinguish

53

four primitives for use as parameters in defining discourse relations: whether the

relationship between propositions is causal or additive (conjunctive); whether the

source of coherence is semantic or pragmatic; whether the order of segments is

“basic” or “non-basic”; whether the relation is positive (e.g. in English typically

signaled by such conjunctions as and or because) or negative (e.g. in English

typically signaled by such conjunctions as but or although). By varying each

parameter, Sanders et al. derive of a set of twelve discourse relations, including

CAUSE-CONSEQUENCE, CLAIM-ARGUMENT and CONCESSION. I join with Knott

and Dale (1995) in doubting that discourse relations can or even ought to be

parameterized in such a neat manner. However, the system described in this

dissertation is in philosophical agreement with two of the guiding principles of

PISA: that a discourse processing module can be based on superficial linguistic

evidence and that it does not need to make explicit reference to real-world

knowledge.

4.4 Hobbs (1979)

Hobbs (1979) outlines a model for inferring coherence relations on the

basis of predicate calculus-like representations of the propositional content of

utterances. The length of chains of inference required to process a text correlates

54

inversely with the coherence of a text, i.e. the more work needed to understand a

text, the less coherent it is.

Hobbs represents superset relations, common world knowledge, and

lexical decomposition by means of axioms, representing “those things a speaker

of English generally knows and can expect his listener to know” (Hobbs

1979:71). Unfortunately, Hobbs does not implement his model, and does not give

principled ways in which such axioms could be acquired and maintained, nor

ways in which linguistic form might constrain the reasoning process. The efficacy

of his proposals is therefore difficult to evaluate.

Predicate calculus-like representations of the propositional content of a

text are insufficient for an analysis within the framework of RST. As noted in

section 3.4, an RST analyst proposes a judgment of the plausibility of suggesting

that the writer intended a certain effect (Mann and Thompson 1988:245). Of

course, the writer may choose to represent the same propositional content in

different forms in order to achieve a desired effect, as the constructed examples in

Figure 10 illustrate. Figure 10(a) reflects the decision to report an event, he ate

dinner, with a secondary event, After John went home, merely serving to provide

a temporal setting for the main proposition. This decision is reflected in an

55

asymmetric relation, the CIRCUMSTANCE relation. Figure 10(b) reflects a decision

to report these two events in a narrative sequence, reflected in the use of the

symmetric relation SEQUENCE.

(a)

1. After John went home

5. he ate dinner.

(b)

1. John went home

6. and then he ate

dinner.

56

Figure 10 Alternative formulations of the same propositional content

Although the text of the two examples presented in Figure 10 would

receive the same predicate-calculus representation, we do not want to give these

two mini-texts the same RST analysis. Clearly the predicate-calculus

representation is insufficient, and must be augmented by a consideration of the

form in which the author chose to express this content.

4.5 The Linguistic Discourse Model (LDM)

Polanyi (1988) proposes a procedure for constructing representations of

discourse structure in a bottom-up, left-to-right fashion. Within the emerging

discourse representation, some nodes are open (i.e. possible attachment points,

available for expansion), while others are closed. A new discourse segment is

compared to each of the available attachment points, and a “semantic

congruence” is computed, to decide to which node the new discourse segment

ought to be attached.

Within the Linguistic Discourse Model there are macro-discourse

structures such as jokes, plans, lists and casual conversations. Each of these has

a formal description of its constituent structure and interpretations. These

macro structures organize the information into a speech event. Speech events in

turn are organized into Interactions. The Linguistic Discourse Model, with its

different types of organization at different levels is thus unlike RST, which has

a single kind of organization for all relations from the clause to the entire

discourse.

Integrating new discourse segments into the emerging discourse

representation involves decisions about whether to subordinate or coordinate

the current discourse segment to an attachment point. These decisions are based

on real world knowledge and inferential processes, the nature and extent of

which are not specified. This unconstrained appeal to real-world knowledge

and inferential processes is a serious impediment to a computational

implementation of the Linguistic Discourse Model. The LDM does not appear

to have been implemented within a computational system.

4.6 Conclusion

From the research described in the preceding sections, three strands

emerge. The first strand (Knott and Dale 1995; Kurohashi and Nagao 1994;

Marcu 1997a; Ono et al. 1994; Sanders 1992; Sanders et al. 1992, 1993;

Sanders and van Wijk 1996; Sumita et al. 1992) concerns the identification of

discourse relations by fairly superficial means—typically simple pattern

matching to identify cue phrases.

The second strand (Fukumoto and Tsujii 1994; Hobbs 1979),

diametrically opposed to the first strand, eschews any examination of the form

of a text in favor of more abstract representations, even augmenting linguistic

representations with axiomatic representations of world knowledge.

The third strand concerns programmatic descriptions of how

computational discourse analysis might proceed (Polanyi 1988; Wu and

Lytinen 1990). The broad strokes of the design of a computational discourse

analyzer are described, but no specific details are given for such essential steps

as the actual identification of discourse relations.

RASTA is most closely aligned with the first of these strands, since it

hypothesizes discourse relations on the basis of the form of a text, without

reference to additional modeling of world knowledge. Unlike previous work

within this strand, RASTA goes beyond identifying cue phrases by means of

simple pattern matching and considers other evidence from a linguistic analysis

of a text, such as tense, aspect, polarity and referential continuity of noun

phrases.

The cues identified by RASTA are discussed at length in chapter 6. The

identification of those cues is dependent on an in-depth linguistic analysis of a

text. RASTA relies on various components of the Microsoft English Grammar

for this linguistic analysis. The Microsoft English Grammar, a mature rule-

based parser with a broad coverage of the English language, and with

considerable resources available to a discourse processing module, is briefly

described in chapter 5.

5. The Microsoft English Grammar

5.1 Introduction

A brief discussion of the Microsoft English Grammar (MEG) is

necessary to provide sufficient context for understanding the role of RASTA

within a computational linguistic system and to demonstrate that there is

sufficient scaffolding to support the task of identifying discourse structure. The

work of the author concerns the discourse component (the focus of this

dissertation), and two facets of the “logical form”: anaphora resolution and

some aspects of the handling of ellipsis. All other aspects of the system

described here are the work of the other members of the Natural Language

Processing Group at Microsoft Research. More complete descriptions of

aspects of the MEG system can be found in Dolan et al. (1993), Pentheroudakis

and Vanderwende (1993), Richardson et al. (1993), Dolan (1995),

Vanderwende (1995a, 1995b) and Richardson (1997). The philosophically

similar PEG system is described in various papers in Jensen et al. (1993).

MEG is a research environment for computational linguistics that runs

under the Microsoft Windows 95 and Windows NT operating systems on

conventional personal computers. The MEG system itself is written in a

combination of the C programming language and a proprietary programming

language called G.3 Systems-level functions are written in C, while the

grammar and portions of the run-time system are written in G. The G

programming language is conceptually an amalgam of C (from which it

particularly derives its syntax and control structures) and Lisp (from which it

particularly derives the notion of the list as a basic data-type), and provides a

formalism to enable linguists to express linguistic rules.

MEG contains a broad-coverage domain-independent grammar of

English capable of processing sentences in a fraction of a second on a

conventional Pentium-based personal computer. Work is currently in progress

at Microsoft Research to develop systems comparable to MEG for the analysis

of Chinese, French, German, Japanese, Korean and Spanish.

The MEG system has a serial architecture, with components

corresponding to the lexicon, syntax, “logical form” and discourse. The

following sections describe these various components.

5.2 Lexicon

The lexical component tokenizes the input string, identifying word

boundaries, separating out clitics, identifying multi-word expressions such as in 3 Exactly what G stands for is the subject of speculation, puns, and general confusion.

order to, and analyzing factoids (minor phrasal constituents such as proper

names or numbers written out in full). The lexical component also contains a

finite-state processor that analyzes or generates derivational and inflectional

morphology (Pentheroudakis and Vanderwende 1993), attaching syntactic and

semantic features to words by a combination of rule-based analysis and

dictionary lookup.

5.3 Syntax

A syntactic analysis follows the lexicon component. The syntactic

analysis consists of two phases, referred to as sketch and portrait.

5.3.1 Sketch

During the sketch phase, constituents are assembled in a bottom-up

fashion to form a syntactic parse with a fairly conventional constituent

structure. The grammatical rules that perform the parsing use only the

information provided by the lexical component. During the sketch phase, rules

only have access to very local structure, and are frequently unable to resolve

syntactic dependencies, such as prepositional phrase attachment. The sketch

therefore defaults to a right attachment for ambiguous syntactic dependencies,

and notes other possible attachment points, rather than producing a multitude of

trees. The output of the sketch component is thus a “packed” parse that

indicates uncertain syntactic dependencies, leaving them for subsequent

components to resolve.

Figure 11 illustrates the sketch produced for the sentence I ate a fish

with a fork, a hoary chestnut of computational linguistics (Jensen and Binot

1987:2524). Heads of constituents are indicated with asterisks. In Figure 11, the

prepositional phrase with a fork has defaulted to a right attachment, subordinate

to the NP. The ?1 notation indicates another possible attachment point for the

prepositional phrase as a sister of the NP.

Figure 11 Syntactic sketch produced by MEG

4 The analysis produced by MEG differs from that of the philosophically similar PEG

system described by Jensen and Binot (1987) only in the label of the final period–in the PEG

system this character would have been labelled PUNC (i.e. punctuation) rather than CHAR (i.e.

character).

Figure 11 is only a visualization of a rich underlying data structure. This

data structure consists of a list of attributes and their values, where those

attributes may have complex structures as their values. Figure 12 illustrates

some of the attributes and their values in the data structure used to represent the

root node of the tree given in Figure 11.5 From this data structure, it is clear

that the example has been parsed as a declarative sentence (the Segtype

attribute has the value SENT and the Nodetype attribute has the value DECL),

with the subject I and the object a fish with a fork.6 Both the subject and object

are themselves complex data structures. The lexical component analyzed the

main verb ate as a morphological variant of the base form eat, and in the

process provided information like the morphological feature Past present in the

attribute called Bits. This sentence has as its head the verb ate, with material

that precedes the head (Prmods) and material that follows the head (Psmods).

As Jensen (1993:31) notes concerning the similar analyses produced by the PEG

system: “PEG’s trees, with their heads and modifiers, have the flavor of a

dependency grammar.”

5 Some attributes have been omitted simply for the sake of brevity.

6 This snapshot of the data structure was taken before MEG had resolved the

prepositional phrase attachment of with a fork. Subsequent processing determines that object

is a fish.

Segtype SENTNodetype DECLNodename DECL1Ft-Lt 0-8String " I ate a fish with a fork ."Rules (Sent VPwNPl VPwNPr1 VERBtoVP)Constits (BEGIN1 VP1 CHAR1)Lex "ate"Lemma "eat"Bits Pers1 Sing Past Closed L9 X9 I0 T1 Loc_srProb 0.25645Prmods NP1 "I"Head VERB1 "ate"Psmods NP2 "a fish with a fork"

CHAR1 "."Subject NP1 "I"FrstV VERB1 "ate"Object NP2 "a fish with a fork"Predicat VP2 "ate a fish with a fork"Topic NP1 "I"

Figure 12 Underlying data structure for the sketch

For strings of words for which MEG cannot construct a plausible

syntactic analysis, it defaults to a fitted parse, i.e. the grammar assembles the

possible constituents into a simple branching structure. For the text of Encarta,

fitted parses are extremely uncommon.

5.3.2 Portrait

The portrait phase of the syntactic component of MEG refines the

syntactic analysis produced during the sketch phase by resolving ambiguous

syntactic dependencies using two strategies: syntactic reattachment and

semantic reattachment.

During syntactic reattachment, MEG performs a top-down traversal of

the syntactic tree, inspecting structural configurations to resolve syntactic

dependencies. As noted in section 5.3.1, the sketch phase operates in a bottom-

up fashion, and therefore has access to only limited context. Syntactic

reattachment, proceeding in a top-down fashion, has access to a much wider

context.

During semantic reattachment, MEG consults a semantic network,

MINDNET (section 5.7), to determine which of several possible syntactic

dependencies is semantically the most likely (Jensen and Binot 1987, Dolan et

al. 1993, Vanderwende 1995b). In the example I ate the fish with a fork,

semantic reattachment considers the various senses of the preposition with.

MEG looks in MINDNET to ascertain whether any relationships exist between eat

and fork or between fish and fork that are compatible with any of the senses of

the preposition. For this example, MEG finds a sense of fork compatible with

the instrument reading of the preposition with: a fork is a utensil used for

eating. MEG therefore resolves the syntactic dependency as illustrated in Figure

13, making the prepositional phrase a sister of the noun phrase. In the process

of resolving this attachment, MEG has implicitly performed word sense

disambiguation for three words: with, eat and fork. During the construction of

the logical form (section 5.4), MEG will use the sense information for the

preposition with to label a relationship as INSTR.

Figure 13 Syntactic portrait produced by MEG

5.4 Logical Form

The logical form component analyzes the syntactic portrait to produce a

graph structure. The logical form represents a normalized view of the predicate

structure of a text, with marked syntactic alternants being noted. For example,

active and passive structures receive the same structural representation in the

logical form, but the logical form derived from the portrait of a passive

sentence is annotated with a feature PASS. Figure 14 illustrates the logical form

derived from the syntactic portrait in Figure 13.

Figure 14 Logical form produced by MEG

Function words such as indefinite articles do not occur in the logical

form, but are instead represented by annotations on nodes. In Figure 14, the

indefinite article a in a fish is represented by the feature +Indef.

As Figure 14 shows, the logical form consists of nodes in labelled

relationships. The label INSTR results from the disambiguation of the

preposition with (section 5.3.2). A subset of the labels used in the logical form

is given in Figure 15. (The labels Dsub and Dobj represent a historical legacy

rather than a present-day commitment to earlier models of Transformational

Grammar. As the logical form matures, our intention is to replace labels like

Dsub with more semantic descriptions like Agent or Experiencer.)

In addition to the labels given in Figure 15, individual prepositions may

appear in the logical form as labels if those prepositions were not

disambiguated during the portrait phase of syntactic analysis. In Figure 16, for

example, the preposition in occurs as the label of a relation in the logical form.

Label Interpretation

Dsub “Deep subject”. (a) The subject of an active clause. (b) The agent of a passive or unaccusative construction.

Dobj “Deep object”. (a) The object of an active clause. (b) The subject of an unaccusative construction.

TmeAt A temporal relation. This same label is used for points in time as well as durations.

Instr Instrument.

Manr Manner.

LocAt Location.

Goal A spatial goal.

Figure 15 Labels used in the logical form

During the construction of the logical form, MEG resolves anaphoric

references and ellipsis, using heuristics that examine features assigned by the

lexicon component and by examining structural configurations in the syntactic

portrait. Figure 16 illustrates the resolution of the reflexive pronoun himself,

which MEG has identified as coreferential with the subject John.

Figure 16 Resolution of reflexive pronoun

In Figure 17, MEG has resolved the coreferential relationship between

John and the pronoun he. Of course, this example has an interpretation in which

he is not coreferential with John. MEG indicates this alternative possibility by

annotating the pronoun he with the feature +FindRef, an instruction to

subsequent stages of processing to consider possible coreferents outside of this

sentence.

Figure 17 Resolution of personal pronoun

As was the case with syntax trees (section 5.3.1), these illustrations of

logical forms are merely visualizations of a rich underlying data structure.

Figure 18 illustrates the data structure underlying the node drive1 in Figure 17.

The SynNode attribute contains a link back to the corresponding syntactic

constituent in the portrait tree. Since the logical form and the portrait tree are

linked in this fashion, RASTA is able to examine either the abstract logical form,

which has something of the flavor of a predicate calculus representation, or the

syntactic analysis.

Nodename drive1Rules LF_PrpCnjs LF_TmeAt LF_Dsub1 SynToSem1Constits drive1 drive1 drive1 DECL1Bits L9 MovSynNode " After John left work, he drove to the store."Pred driveDsub John1TmeAt leave1PrpCnjs store1PrpCnjLem after

Figure 18 Data structure underlying the node drive1

5.5 Word Sense Disambiguation

The component that follows the logical form performs word sense

disambiguation. This component examines the syntactic analysis and consults

MINDNET to identify the most likely senses of words in the logical form. The

end result of this analysis is a logical form in which nodes are annotated with

sense information.

The word sense disambiguation component applies optionally. For the

discourse research conducted to date, this component has not been applied.

5.6 Discourse

Finally, the discourse module, RASTA, attempts to identify rhetorical

relations based on an examination of the syntactic portrait and the logical form.

Having identified those rhetorical relations, RASTA then constructs

representations of discourse structure. Since the operation of RASTA is the topic

of this dissertation, it will not be described further here.

5.7 MINDNET

MINDNET is a large semantic network, which has been constructed

automatically (Dolan et al. 1993; Richardson et al. 1993; Dolan 1995;

Vanderwende 1995a, 1995b; Richardson 1997) by extracting semantic

information from two dictionaries: Longman Dictionary of Contemporary

English (Proctor 1978) and The American Heritage Dictionary (Houghton

Mifflin 1992). Currently, MINDNET consists of 120,000 head words, and

approximately seven million labeled arcs connecting lexical senses of those

head words.

MINDNET is not strictly speaking a component within the serial

architecture described in section 5.1. Rather, MINDNET is a resource consulted

by two components of MEG, the portrait phase of the syntactic component and

word sense disambiguation. RASTA does not explicitly consult MINDNET during

discourse processing. See section 8.3.

5.8 Conclusion

MEG provides a broad-coverage grammar of English that yields an in-

depth analysis of the syntactic structure of a text and a representation of aspects

of its propositional structure and semantics. MEG is thus an excellent

framework within which to conduct research on methods for automatically

constructing representations of the structure of written discourse. The following

chapters describe in detail exactly how RASTA operates within this framework.

6. Cues to Discourse Structure

6.1 Introduction

RASTA successfully identifies rhetorical relations by considering

evidence from a linguistic analysis of the text. Section 6.7 lists the cues used to

identify each relation, with illustrative examples. Lest the reader become mired

in the detail of individual relations, several larger issues raised by this list of

cues are discussed first. The clausal status of terminal nodes—whether they are

in a hypotactic or paratactic relationship—is a useful criterion for making the

coarse determination of whether an asymmetric or symmetric relation is most

likely (section 6.2). Similarly anaphora and deixis (section 6.3) play a crucial

role in making this determination.

Finally the enumeration of the cues used to identify discourse relations

is preceded by discussion of the architecture of the identification process—the

manner in which heuristic scores are associated with cues (section 6.4) and the

distinction between the necessary criteria and cues (section 6.5)—and a

consideration (section 6.6) of the dependence of the work presented here on the

particular set of thirteen relations that RASTA employs.

In identifying cues to discourse structure, it is important to emphasize

that I am not proposing an exhaustive list of all the linguistic correlates of each

of the rhetorical relations. Rather, the cues given below, comprising a relatively

small set of approximately fifty members, are ones that have proven to be

sufficient for distinguishing among the thirteen rhetorical relations employed in

this study.

6.2 Correlations between clausal status and rhetorical status

Following the Hallidayan tradition, Matthiessen and Thompson (1988)

distinguish between clause embedding (covering restrictive relative clauses, as

well as subject and object complements) and clause combining. Within clause

combining, they distinguish parataxis (including coordination, apposition and

quoting) and hypotaxis (including non-restrictive relative clauses, reported

speech, and other subordination of one clause to another). Observing the strong

analogue between the rhetorical organization of texts and the grammatical

organization of clauses, Matthiessen and Thompson propose that hypotactic

clause combining represents the grammaticization of asymmetric RST relations,

with the matrix clause corresponding to the nucleus of the RST relation and the

subordinate clause corresponding to the satellite. This proposal motivates the

most important discriminator of rhetorical relations employed by RASTA.

Hypotactic clause combining7, identified by the syntactic analysis performed by

MEG (section 5.3), always suggests an asymmetric RST relation in which the

matrix clause is posited to be the nucleus and the subordinate clause to be the

satellite. In cases that do not involve hypotactic combinations, for example in

considering the relationship between the main clauses of two sentences, either a

symmetric or an asymmetric rhetorical relationship may hold.

In rare cases, this correlation between clausal status and rhetorical status

is the only clue to discourse structure that RASTA is able to identify, e.g. having

correctly identified a hypotactic relationship, RASTA is unable to identify a

specific corresponding asymmetric rhetorical relation. In such cases, RASTA

proposes an asymmetric relationship which it then labels with a question mark,

as illustrated in Figure 19. Clause2 is clearly a satellite of Clause1. However, it

is not quite clear exactly what RST relation holds. The PURPOSE or RESULT

relations are weak candidates, but certainly not inviting enough to warrant a

commitment to either.

7 Non-restrictive relative clauses, one kind of hypotactic clause combining, are not

treated as base level textual units by RASTA (section 3.2).

1. The legs have powerful claws,

7. adapting the animal for rapid digging into hard ground.

Figure 19 Echidna

6.3 The role of anaphora, deixis and referential continuity

Anaphoric references and deixis, two strongly cohesive devices

(Halliday and Hasan 1976), are frequently examined by RASTA during the

identification of discourse relations. Often, it is sufficient to identify the form

of a referring expression. Pronouns and demonstratives, for example, are

frequently positively correlated with the satellite of an asymmetric relation (see

for example criterion 4, Figure 27; criterion 4, Figure 52; cue H6, Figure 53;

inter alia), especially when they occur as syntactic subjects or as modifiers of

subjects, and negatively correlated with the co-nucleus of a symmetric relation

(see for example criterion 6 for the JOINT relation, Figure 62). In other cases,

the form of a referring expression is insufficient, and RASTA must consider

referential continuity. The MEG system resolves pronominal anaphoric

references during the construction of the logical form. Although MEG is

sometimes able to identify a single antecedent for a pronoun, it often proposes a

list of plausible antecedents. In determining subject continuity, the most

important kind of referential continuity for identifying discourse relations,

RASTA considers whether the subject of one clause is one of the possible

antecedents of the subject of another clause. For a pronominal subject, RASTA

examines the list of proposed antecedents. For a subject modified by a

possessive pronoun, RASTA considers the proposed antecedents of the

possessive pronoun. For lexical subjects, RASTA considers simply whether the

head of the subject noun phrase of one clause is identical to the head of the

subject noun phrase of the other clause.8

6.4 Heuristic scores

Intuitively, some cues to discourse structure are more compelling than

others. To reflect this intuition, RASTA assigns numerical heuristic scores (with

8 MEG does not currently perform anaphora resolution for lexical noun phrases,

although we intend to develop a module for performing resolution for such noun phrases. Any

anaphoric resolution of lexical noun phrases performed by MEG would of course then become

available to RASTA for consideration. In Encarta, identity of the heads of subjects is a

remarkably effective method for determining subject continuity for two lexical noun phrases.

values ranging from five to 35) to each cue. The heuristic score for a

hypothesized discourse relation is equal to the sum of the heuristic scores of

each of the pieces of evidence that lead to positing that relation. In practice,

positing a discourse relation relies on observing the convergence of a number

of cues (compare to Litman and Passonneau 1995, who identify segment

boundaries in spoken discourse by observing multiple simultaneous signals).

Although I do not wish to claim that the cues and heuristics that prove

useful for computing representations of discourse structure are also

psychologically valid, to date the heuristic scores which have proven useful in

computing discourse representations have been in accord with my intuitions as

a linguist. For example, explicit cue words and phrases provide strong support

for hypothesizing a particular discourse relation, but syntactic structural cues

provide weaker evidence.

The fact that the heuristic scores accord well with linguistic intuitions is

theoretically satisfying. However, the scores are primarily motivated by

considerations of computational efficiency during the construction of RST trees

(chapter 7). Having posited discourse relations between terminal nodes, RASTA

proceeds to construct RST trees in a bottom-up manner (section 7.5). RASTA

applies the hypothesized relations with the highest heuristic scores first, thereby

converging on the best analysis of a text first. Less plausible analyses can be

produced by allowing RASTA to apply the hypothesized relations with lower

heuristic scores.

In section 7.5.5, I measure the effectiveness of the set of heuristic scores

given here and discuss the potential role of machine learning algorithms in

determining an optimal set of values.

6.5 Necessary criteria and cues

The process of hypothesizing discourse relations involves tension

between two competing concerns. On the one hand, it is desirable to postulate

all possible discourse relations that might hold between two terminal nodes, in

order to ensure that the preferred RST analysis is always in the set of analyses

produced by RASTA. On the other hand, considerations of computational

efficiency lead us to desire a small set of relations, since as the number of

possible discourse relations increases, the number of possible discourse trees to

be considered increases exponentially; the smaller the set of hypothesized

relations, the more quickly the algorithm for constructing RST trees (section

7.5) can test all possibilities.

RASTA resolves this tension by distinguishing two kinds of evidence.

The first kind of evidence is the set of necessary criteria—the conditions that

simply must be met before RASTA is even willing to “consider” a given

discourse relation. The second kind of evidence is the set of cues that are only

applied if the necessary criteria are satisfied. Coordination by means of the

conjunction and, for example, correlates with the SEQUENCE conjunction

(Figure 87, section 6.7.13), but only weakly. If we were to posit a SEQUENCE

relation every time we observed the conjunction and, we would posit a great

many spurious relations. However, RASTA only tests this cue if an extensive set

of necessary criteria for the SEQUENCE relation have been satisfied (Figure 82,

section 6.7.13).

6.6 Dependence on a set of relations

As noted in section 3.6, the thirteen relations employed in this study are

a relatively uncontroversial subset of the rhetorical relations that have been

proposed in the literature on discourse relations. They are also relations that can

be identified in text with a high degree of reliability. Should future research

motivate a different set of relations, an approach similar to the one described

here could no doubt be used to identify them. For example, syntactic analyses

could be examined to identify cues that correlate with the those relations. Cues

would correlate with the relations to varying degrees, so heuristic scoring

(section 6.4) would still be useful. Finally, the algorithm for constructing trees

on the basis of a set of hypothesized relations (chapter 7) is not dependent on

specific rhetorical relations, and so would not require any modification.

6.7 Cues to the relations

In this section I present the cues used in RASTA to identify rhetorical

relations in Encarta. RASTA examines all pairs of clauses from the total set of

RST terminal nodes. For each pair of clauses, RASTA tests the conditions in two

orders, i.e. for two clauses a and b, RASTA tests the cues with clause a as the

first clause (labeled Clause1 below) and clause b as the second clause (labeled

Clause2 below) and then with clause b as the first clause and clause a as the

second clause.

For the sake of brevity, in discussing some relations below I make

reference to the “Subordinate Clause Condition”. The Subordinate Clause

Condition is satisfied if the conditions given in Figure 20 are met.

1. Clause1 is a main clause.

2. If Clause2 is a subordinate clause then it must be subordinate to

Clause1.

Figure 20 The Subordinate Clause Condition

6.7.1 ASYMMETRICCONTRAST

The ASYMMETRICCONTRAST relation involves a contrast between two

constituents that are not of equivalent rhetorical status in the text. (Compare

this to the CONTRAST relation discussed in section 6.7.6, which consists of two

nuclei with equivalent rhetorical status in the text.) The following extended

excerpt from Encarta illustrates the different rhetorical statuses of the two

constituents of an ASYMMETRICCONTRAST relation. The Aardwolf article

discusses aardwolves at length, and in the final sentence (indicated in bold

type) contrasts their forefeet with those of hyenas. Mention of the anatomy of

hyenas in this context is however rhetorically subordinate to the main goal of

the passage, namely describing aardwolves.

“The aardwolf is classified as Proteles cristatus. It is usually

placed in the hyena family, Hyaenidae. Some experts, however,

place the aardwolf in a separate family, Protelidae, because of

certain anatomical differences between the aardwolf and the

hyena. For example, the aardwolf has five toes on its forefeet,

whereas the hyena has four.” (Aardwolf)

All instances of the ASYMMETRICCONTRAST relation observed in

Encarta involve the conjunction whereas. However, it is not the case that the

presence of the conjunction whereas in Encarta always correlates with this

relation, as the following excerpt illustrates:

“Only twisting is required to process filament fiber into yarn,

but staple fibers must be carded to combine the fibers into a

continuous ropelike form, combed to straighten the long fibers,

and drawn out into continuous strands, which are then twisted to

the desired degree. In general, the amount of twist given the

yarns determines various characteristics. Light twisting yields

soft-surfaced fabrics, whereas hard-twisted yarns produce

hard-surfaced fabrics, which provide resistance to

abrasion…” (Textiles)

Clearly, in this example, textiles, the topic under discussion, are not

being contrasted with something else. Rather, two different methods for

producing textiles are being contrasted. The final sentence of this excerpt could

be considered an ELABORATION of the sentence In general, the amount of twist

given the yarns determines various characteristics.

All instances of the ASYMMETRICCONTRAST relation observed in

Encarta to date share two characteristics. First, the subject of the nucleus of the

relation refers to the local discourse topic, i.e. the topic of the sub-section of the

Encarta article in which the nucleus occurs. Of course, an identification of a

local discourse topic presupposes an existing sophisticated discourse analysis.

One simple technique that has proven extremely effective in Encarta is to insist

that the head of the subject of the nucleus have the same base form as the head

of the title noun phrase of the section within which the nucleus occurs. In the

extended excerpt above, for example, the subject of the nucleus, aardwolf, has

the same base form as the title of the article within which the excerpt occurs.

Second, the satellite contains the conjunction whereas. Although the

conjunction whereas is present in all observed instances of the

ASYMMETRICCONTRAST relation, I do not include it in the necessary criteria.

Since it is likely that other cues will be discovered, as with the other relations

discussed below, I am reluctant to include the identification of any one

conjunction in the necessary criteria of a relation. Figure 21 gives the necessary

criteria for the ASYMMETRICCONTRAST relation.

1. Clause1 is syntactically subordinate to Clause2.

2. The head of the subject of Clause2 has the same base form as the

head of the title of the section within which Clause2 occurs.

Figure 21 Necessary criteria for the ASYMMETRICCONTRAST relation

If the necessary criteria given in Figure 21 are satisfied, RASTA tests cue

H20, given in Figure 22.

Cue Heuristic score

Cue name9

Clause1 contains the subordinating conjunction

whereas.

30 H20

Figure 22 Cue to the ASYMMETRICCONTRAST relation

9 Note that the names of the cues are arbitrary. By convention, the names start with

the letter H (for heuristic) and are followed by a number and optionally a letter. The attentive

reader might notice that some possible names (e.g. H19) do not occur in this dissertation. This

represents historical accident (an old cue with that label has been deemed unnecessary in the

system) rather than oversight.

The satellite of an ASYMMETRICCONTRAST relation may follow the

nucleus, as illustrated in Figure 23, or it may precede the nucleus, as illustrated

in Figure 24 and Figure 25.

1. Some experts, however, place the aardwolf in a separate

family, Protelidae, because of certain anatomical differences

between the aardwolf and the hyena.

2. For example, the aardwolf has five toes on its forefeet,

3. whereas the hyena has four.

Figure 23 Aardwolf

1. [W]hereas Fénelon supported quietism,

2. Bossuet considered it heresy.

Figure 24 Bossuet, Jacques Bénigne

1. Whereas pure neon gives a red light,

2. Argon tubes require a lower volt.

Figure 25 Argon

Clauses introduced by the conjunction whereas are invariably parsed as

syntactically subordinate by MEG. What then is to be done concerning the

Textiles excerpt above? The preferred analysis for the relevant section of this

excerpt is given in Figure 36.

1. In general, the amount of twist given the yarns determines

various characteristics.

2. Light twisting yields soft-surfaced fabrics,

3. whereas hard-twisted yarns produce hard-surfaced fabrics…

Figure 26 Textiles

Is a symmetric relation to be proposed between one clause and another clause,

where the latter is syntactically subordinate to the former, in violation of the

general principle outlined in section 6.2 that syntactically subordinate clauses

are always to be treated as rhetorically dependent on their matrix clauses? Since

the MEG system operates in a serial fashion, with morphology preceding

syntactic analysis and so on, it is occasionally necessary for a subsequent level

of processing that has access to additional resources to modify an earlier

analysis. As noted in chapter 5, for example, the first stage of syntactic analysis

defaults to a right-branching structure for prepositional phrase attachment, but

subsequent processing can revise the analysis based on reasoning with

MINDNET. Similarly, information available from discourse processing, namely

that the head of the subject is not the same as the head of the title of the section

in which the clause occurs, could be used to modify the syntactic analysis,

treating cases like the Textiles excerpt illustrated in Figure 26 as syntactically

coordinate constructions, analogous to the analysis of two clauses coordinated

by the CONTRAST conjunction like but.

6.7.2 CAUSE

RASTA distinguishes two related asymmetric relations: CAUSE and

RESULT. CAUSE relations are those in which a cause is expressed in the satellite,

and the result in the nucleus (see for example the definition of VOLITIONAL

CAUSE in section 3.4), whereas RESULT relations are those in which the result is

expressed in the satellite, and the cause in the nucleus. As noted above (section

3.6), I have collapsed the two relations VOLITIONAL CAUSE and NON-

VOLITIONAL CAUSE defined by Mann and Thompson (1988) into a single

relation CAUSE, and the two relations VOLITIONAL RESULT and NON-

VOLITIONAL RESULT defined by Mann and Thompson (1988) into a single

relation RESULT (section 6.7.12).

RASTA uses different cues for the CAUSE relation depending on whether

the Subordinate Clause Condition is satisfied. These cues are presented

separately below.

Criteria for the CAUSE relation when the Subordinate Clause

Condition is satisfied

The Subordinate Clause Condition defines the necessary criteria that

must be satisfied before the cues to the CAUSE relation given in Figure 27 are

tested. As noted in section 6.7.1, even if only a single cue has been identified to

date, I am reluctant to include it as one of the necessary criteria. It is likely that

additional research will uncover other cues to the CAUSE relation that ought to

be applied when the subordinate clause condition is satisfied.

Cue Heuristic score

Cue name

Clause2 is dominated by or contains a cue phrase

compatible with the CAUSE relation (because

due_to_the_fact_that since…)

25 H17

Figure 27 Cues to the CAUSE relation

Simon Corston, 01/03/-1,

Used to also have cue H18 in here, i.e. use CausBy, but most things identified as CausBy by Lucy’s code are NPs and so would be rejected as terminal nodes by Rasta.

The satellite of a CAUSE relation may precede or follow the nucleus.

Figure 28 illustrates a simple case. A CAUSE relation is hypothesized between

clauses 1 and 2 on the basis of the cue phrase due to the fact that, identified by

cue H17.

1. [D]ue to the fact that inflows from the Amu Darya have also

drastically diminished in recent decades,

2. the volume of the Aral Sea dropped by about 76 percent

between 1960 and 1995.

Figure 28 Syrdarya

The definition of cue H17 says that “Clause2 is dominated by or

contains a cue compatible with the CAUSE relation.” This circumlocution is

motivated by constraints on the well-formedness of RST trees. Figure 29

illustrates a case where two clauses are dominated by the conjunction because:

clause 2, …many women are aware, and clause 3, and concerned that…, are in

a JOINT relation. The women’s concerns are given as the cause of the increasing

popularity of natural childbirth. As described in section 7.5.1, RASTA will only

posit a CAUSE relation between the JOINT node and clause 1 if it can find

evidence of a CAUSE relation between Clause 1 and each of the co-nuclei of the

JOINT node. Thus RASTA must examine the syntactic and logical form analyses

produced by MEG to determine that the conjunction because has scope over

both clause 2 and clause 3. Clause 2 is syntactically dependent on clause 1; the

dominating conjunction because indicates the appropriate rhetorical label for

the relationship. Clause 3 is also syntactically dependent on clause 1, and is

dominated by the same conjunction, because.

1. Natural (unmedicated) childbirth, however, is becoming more

popular,

8. in part because many women are aware

9. and concerned that the anesthesia and medication given to

them is rapidly transported across the placenta to the unborn

baby.

Figure 29 Pregnancy and childbirth

Necessary criteria for the CAUSE relation when the Subordinate

Clause Condition is not satisfied

Figure 30 gives the necessary criteria for the CAUSE relation when the

Subordinate Clause Condition is not satisfied.

1. Clause1 precedes Clause2.

10.Clause1 is not syntactically subordinate to Clause2.


12.Either the syntactic subject or Dsub of Clause2 is a

demonstrative pronoun or is modified by a demonstrative; or

the Dsub of Clause1 and the Dsub of Clause2 are distinct

constituents (i.e. neither one is gapped) and the same lexical

item occurs as the head of the Dsub of Clause1 and the head

of the Dsub of Clause2 and that lexical item is not a pronoun;

or Clause1 and Clause2 are coordinated by a semi-colon.

Figure 30 Necessary criteria for the CAUSE relation when the Subordinate


Criterion 1 is not to be interpreted as specifying that the relative

ordering of the nucleus and satellite is important. Rather, since RASTA

examines all possible pairs of terminal nodes (chapter 7), it is simpler to

formulate the conditions for the CAUSE relation to apply when Clause1 is the

nucleus and Clause2 is the satellite. For example, in the absence of this criterion

cue H29a in Figure 31 would have to say “Clause2 is passive and has the lexical

item cause as its head or Clause1 is passive and has the lexical item cause as its

head”. Clearly, this would be a rather verbose formulation. In cases where the

relative order of constituents really is important, this fact is stipulated (for

example, cue H22, a cue to the RESULT relation given in section 6.7.12 or

criterion 1 for the SEQUENCE relation, given in section 6.7.13). In all other

cases, the relative order of the nucleus and the satellite is assumed to be

unimportant.

Criteria 2 and 3 are merely intended to check that the clauses being

examined are not in a syntactic dependency relationship.

Criterion 4 is somewhat more complex. Any one of the conjuncts of this

criterion must be true before RASTA will consider additional cues to the CAUSE

relation when the Subordinate Clause condition is not satisfied. The first part of

criterion 4, “Either the syntactic subject or Dsub of Clause2 is a demonstrative

pronoun or is modified by a demonstrative” is intended merely to identify the

strong correlation between deixis and rhetorical structure (section 6.3). As

noted in section 6.5, the distinction between necessary criteria and cues,

although it correlates well with linguistic judgments, is primarily motivated by

considerations of computational efficiency. For the CAUSE relation, the

correlation between deixis and rhetorical structure is so strong as to warrant

inclusion as one part of the necessary criteria for the identification of the

relation.

The second part of criterion 4, “or the Dsub of Clause1 and the Dsub of

Clause2 are distinct constituents (i.e. neither one is gapped) and the same lexical

item occurs as the head of the Dsub of Clause1 and the head of the Dsub of

Clause2 and that lexical item is not a pronoun”, is intended to identify patterns

of referential continuity that correlate highly with asymmetric relations. When

two clauses whose Dsub nodes in the logical form (the abstraction of the

subject of an active sentence or the agent of a passive construction, section 5.4)

contain the same pronoun, it is likely that there is referential continuity. RASTA

could verify this by examining the evidence of the anaphora resolution

component of MEG that resolves anaphoric references for pronouns (section

5.4). However, this proves unnecessary. An examination of Encarta texts

suggests that the pattern in which the two Dsubs contain the same pronoun is

negatively correlated with asymmetric relations.

The final part of criterion 4, “or Clause1 and Clause2 are coordinated by

a semi-colon” identifies the final possible necessary criterion for the CAUSE

relation. Of course, other relations are compatible with clauses coordinated by a

semi-colon (for example, the LIST relation). However, at least one of the three

possibilities given in criterion 4 must be satisfied before RASTA should consider

a CAUSE relation.

Cues for the CAUSE relation when the Subordinate Clause

Condition is not satisfied

If the necessary criteria for the CAUSE relation are satisfied, RASTA tests

the two cues given in Figure 31.

Cue Heuristic score

Cue name

Clause2 is passive and has the lexical item cause as

its head.

10 H29a

The head of Clause2 contains the phrase result

from, with the verb possibly being inflected.

10 H29b

Figure 31 Cues to the CAUSE relation

Both cue H29a and cue H29b make reference to specific lexical items,

lexical items whose inherent semantics pertain to causality. In Figure 32, the

necessary criteria for the CAUSE relation are satisfied: neither clause is

syntactically dependent on the other (criteria 2 and 3) and the Dsub of both

clauses, isolation, is the same lexical item and is not a pronoun (criterion 3).

Cue H29b applies to identify the phrase “result from”.

1. In the third step, intrinsic isolation, some form of isolation

evolves among the populations.

2. Such isolation may result from preferences during courtship

or from genetic incompatibility.

Figure 32 Species and speciation

As the definition of cue H29b notes, the verb result may be inflected.

This is illustrated in Figure 33. Although it would not be too onerous to

enumerate the possible variants of the verb (result/s/ed/ing), the morphological

analysis performed by MEG during the course of parsing the input text makes it

unnecessary to enumerate all possibilities. MEG correctly identifies the base

form of resulted as result.

1. At the end of the 20th century, de facto segregation remained

a problem in many places in the United States.

2. De facto segregation has resulted from residential housing

patterns.

Figure 33 Segregation in the United States

Figure 34 illustrates another instance of the CAUSE relation. The

necessary criteria are satisfied (neither clause is syntactically dependent on the

other and the syntactic subject of Clause 2 is modified by the demonstrative

such) the cue H29a correctly identifies clause 2 as a passive clause whose head

is the verb cause.

1. The mechanical loss of fertile topsoil is one of the gravest

problems of agriculture.

2. Such loss is almost always caused by erosion resulting from

the action of water or wind.

Figure 34 Soil management

6.7.3 CIRCUMSTANCE

The CIRCUMSTANCE relation is an asymmetric relation. Mann and

Thompson define a CIRCUMSTANCE relation as one in which the satellite “sets a

framework in the subject matter within which [the reader] is intended to

interpret the situation presented in [the nucleus]” (Mann and Thompson

1988:272). To date, the CIRCUMSTANCE relation has only been encountered in

Encarta within a single sentence. If the Subordinate Clause Condition is

satisfied, then the cues given in Figure 35 are tested.

Cue Heuristic score

Cue name

Clause2 is dominated by or contains a

circumstance conjunction (after before while…).

20 H12

Clause2 is a detached –ing participial clause and

the head of Clause2 precedes the head of Clause1.

5 H13

Figure 35 Cues to the CIRCUMSTANCE relation

Figure 36 and Figure 37 illustrate the application of cue H12. In Figure

36, clause 1 is in a CIRCUMSTANCE relation to both clause 2 and clause 3.

Clause 1 can therefore be said to be in a CIRCUMSTANCE relation to the text

span covering clause 2 and clause 3.

1. After the revolt was crushed,

2. Solomon stripped Abiathar of priestly office

3. and banished him from Jerusalem.

Figure 36 Abiathar

In Figure 37, clause 2 follows the main clause. The Subordinate Clause

Condition is still satisfied, however, and cue H12 still detects the conjunction

after and identifies the CIRCUMSTANCE relation.

1. In April 1994 fighting erupted between Rwanda's two main

ethnic groups, the Hutu and Tutsi,

2. after the presidents of both Rwanda and Burundi were killed

in a suspicious plane crash.

Figure 37 Africa

Figure 38 illustrates the application of heuristic H13: clause 1 is a

detached preposed –ing participial clause.

1. Leaving port on October 19 and 20,

2. Villeneuve’s fleet was intercepted by Nelson’s fleet on the

morning of October 21.

Figure 38 Trafalgar, Battle of

When heuristic H13 applies in Encarta, it is almost always the case that

heuristic H12 also applies. Figure 39 illustrates a case in which cues H12 and

H13 both apply: clause 1 is both a preposed detached –ing participial clause

and contains the conjunction while.

1. While recuperating from sunstroke, which put an end to a

potential baseball career with the New York Yankees,

2. Acuff developed a serious interest in music.

Figure 39 Acuff, Roy

Cue H13 specifies that “Clause2 is a detached –ing participial clause and

the head of Clause2 precedes the head of Clause1.” Detached –ing participial

clauses in which the head follows the head of the matrix clause tend be in a

RESULT relation in Encarta, rather than a CIRCUMSTANCE relation. Cue H22 in

section 6.7.12 identifies a RESULT relation in such cases.

6.7.4 CONCESSION

The CONCESSION relation is an asymmetric relation. The writer

acknowledges an apparent incompatibility between the situations presented in

the nucleus and the satellite, but regards the situations as compatible. Mann and

Thompson (1988:254-5) note that “recognizing the compatibility between the

situations presented in [the nucleus] and [the satellite] increases [the reader’s]

positive regard for the situation presented in [the nucleus].”

To date, the CONCESSION relation has only been identified in Encarta

within single sentences. If the Subordinate Clause Condition is satisfied, then

the cue given in Figure 40 is tested.

Cue Heuristic score

Cue name

Clause2 contains a CONCESSION conjunction

(although even_though without…)

10 H11

Figure 40 Cue to the CONCESSION relation

Figure 41 illustrates the application of cue H11 to identify the

subordinate clause without ever having been exposed to the Italian High

Renaissance as the satellite in a CONCESSION relation.

1. Grünewald seems to have achieved a form of Mannerism

2. without ever having been exposed to the Italian High

Renaissance.

Figure 41 Renaissance Art and Literature

In Encarta, the satellite in a CONCESSION relation is frequently an

elliptical clause, as illustrated in Figure 42 and Figure 43. The meaning of the

CONCESSION relation is clearly illustrated in Figure 42. Ordinarily, being timid

(clause 1) and being prepared to fight (clause 2) might be considered by a

reader to be incompatible situations. The writer, aware perhaps of this apparent

incompatibility, nonetheless wishes to present as fact the aardvark’s

preparedness to fight (clause 2) in some circumstances, namely when it cannot

flee (clause 3).

1. Although timid,

13. the aardvark will fight

14.when it cannot flee.

Figure 42 Aardvark

1. Although organized into regional and central groups

2. each church governs itself independently.

Figure 43 Adventists

The satellite of a CONCESSION relation is frequently an elliptical clause,

but can also be a full clause, as illustrated in Figure 44.

1. Although the Quakers had long opposed slavery,

15.abolitionism as an organized force began in England in the

1780s…

Figure 44 Abolitionists

Encarta also contains a great many prepositional phrases introduced by

the preposition despite. These phrases are not clauses with “independent

functional integrity” (Mann and Thompson 1988:248), the essential criterion

for identifying terminal nodes for an RST tree (section 3.2). In contrast to an

approach based on superficial pattern-matching (section 4.2.6), RASTA is able

to examine a complex syntactic analysis in order to correctly identify these

prepositional phrases and so does not treat them as terminal nodes in an RST

tree. The following two excerpts illustrate the use of despite to introduce

prepositional phrases.

Mobile for a while became an important shipbuilding city

despite the shallowness of Mobile Bay. (Alabama)

About 75 percent of eligible voters participated, despite

threats from the outlawed Islamic Salvation Front to

kill anyone who voted. (Algeria)

The bold sections of these two examples might be considered to entail

propositions of a kind that ought to be modeled in RST. The phrase despite the

shallowness of Mobile Bay entails the proposition “Mobile Bay is shallow” and

the phrase despite threats from the outlawed Islamic Salvation Front to kill

anyone who voted entails the proposition “the outlawed Islamic Salvation Front

threatened to kill anyone who voted.” Is it appropriate to insist that a terminal

node in an RST analysis ought to correspond to a clause with “independent

functional integrity” (Mann and Thompson 1988:248; see section 3.2)? As

noted in chapter 1, the development of RASTA has been guided by a

functionalist perspective on language: writers manipulate linguistic form to

achieve their communicative objectives. We ought therefore to attach

significance to the fact that the writers of these two excerpts chose to use a

phrasal formulation rather than to express the same propositional content by

means of clauses with independent functional integrity. Alternatively, we could

regard the fact that RASTA does not model such phrases as nothing more than a

restriction on the granularity of the analysis performed.

6.7.5 CONDITION

The CONDITION relation is an asymmetric relation. Mann and

Thompson (1988:276) note that in a CONDITION relation “realization of the

situation presented in [the nucleus] depends on realization of that presented in

[the satellite].” To date, the CONDITION relation has only been identified in

Encarta within a sentence. If the Subordinate Clause Condition is satisfied,

then the cue given in Figure 45 is tested.

Cue Heuristic score

Cue name

Clause2 contains a condition conjunction

(as_long_as if unless…)

10 H21

Figure 45 Cue to the CONDITION relation

Figure 46 illustrates the CONDITION relation identified by the presence

of the subordinating conjunctive phrase as long as.


Code also says “And Clause1 and Clause2 are not coordinated”

1. The premier and cabinet remain in power

16.as long as they have the support of a majority in the provincial

legislature.

Figure 46 Prince Edward Island

As noted earlier (section 4.2.6), the expression as long as should

sometimes be treated as a phrase with internal syntactic structure and at other

times as a single unit, a phrase that functions as a subordinating conjunction. As

Figure 47 shows, the syntactic analysis performed by MEG correctly identifies

the expression as long as as a subordinating conjunction in this example.

Figure 47 Prince Edward Island: Syntactic analysis10

Figure 48 illustrates the CONDITION relation identified by the presence

of the subordinating conjunction if, by far the most common manner in which

the CONDITION relation is signaled in Encarta.

10 The attachment of PP5, in the provincial legislature, is ambiguous, but is not

relevant to the determination of the discourse relation between the two clauses.

1. If not promptly treated by surgical means,

2. ectopic pregnancy can result in massive internal bleeding and

possibly death.

Figure 48 Pregnancy and Childbirth

6.7.6 CONTRAST

The CONTRAST relation is a symmetric relation. Mann and Thompson

(1988:278) note that the situations presented in the nuclei are “(a)

comprehended as the same in many respects (b) comprehended as differing in a

few respects and (c) compared with respect to one or more of these

differences.” RASTA distinguishes two different relations that satisfy these

criteria. In the ASYMMETRICCONTRAST relation (section 6.7.1), two

propositions are contrasted, but the propositions are not of equal rhetorical

status in the text. Rather, the proposition contained in the nucleus is more

central in realizing the writer’s goals. In the CONTRAST relation, the

propositions that are related are of equal rhetorical status.

RASTA employs two distinct sets of criteria in identifying the CONTRAST

relation. The first set involves a work-around (mentioned in section 6.7.1) to

compensate for the fact that during a sentence-by-sentence syntactic analysis of

a text, MEG is not able to decide whether a clause introduced by the

conjunction whereas ought to be analyzed as syntactically subordinate or

coordinate. Figure 49 gives the necessary criteria for this work-around.

1. Clause1 is syntactically subordinate to Clause2.

2. The head of the syntactic subject of Clause2 does not have the same

base form as the head of the title of the section within which Clause2

occurs.

Figure 49 Necessary criteria for the CONTRAST relation work-around

If the necessary criteria given in Figure 49 are satisfied, RASTA tests cue

H42, given in Figure 50. (As noted in 6.7.1, it is not desirable to include

specific conjunctions in the set of necessary criteria for the identification of a

relation.) The necessary criteria given in Figure 49, combined with cue H20,

are intended to correctly identify a CONTRAST relation, as illustrated in Figure

51. Examples of the ASYMMETRICCONTRAST relation which these necessary

criteria together with cue H42 correctly do not select are given in section 6.7.1.

Cue Heuristic score

Cue name

Clause1 contains the subordinating conjunction

whereas.

30 H20

Figure 50 Cue to the CONTRAST relation work-around

1. In general, the amount of twist given the yarns determines

various characteristics.

2. Light twisting yields soft-surfaced fabrics,

3. whereas hard-twisted yarns produce hard-surfaced fabrics…

Figure 51 Textiles

Having dealt with the special case of a work-around for a misparse, let

us now consider the more general criteria and cues for the CONTRAST relation.

Figure 52 gives the necessary criteria for the CONTRAST relation.


2. Clause1 is not syntactically subordinate to Clause2.

3. Clause2 is not syntactically subordinate to Clause1.

4. The subject of Clause2 is not a demonstrative pronoun, nor is it

modified by a demonstrative.

Figure 52 Necessary criteria for the CONTRAST relation

The criteria given in Figure 52 do not hold any surprises. With the

exception of the cases dealt with by the work-around, we would not expect the

co-nuclei of the CONTRAST to be in a relationship involving syntactic

dependency (criteria 2 and 3), in line with the observations about the

correlation between clausal status and rhetorical status (section 6.2). Similarly,

criterion 4, “The subject of Clause2 is not a demonstrative pronoun, nor is it

modified by a demonstrative” is intended to exclude cases in which an

asymmetric relation is more plausible, given the interplay of deixis and

rhetorical structure discussed in section 6.3. (More general considerations of

anaphoric reference and referential continuity have not proven necessary to

distinguish the CONTRAST relation from other relations.) If the necessary

criteria given in Figure 52 for the CONTRAST relation are satisfied, then the

cues given in Figure 53 are tested.

Cue Heuristic score

Cue name

Clause2 is dominated by or contains a CONTRAST

conjunction (but however…). If Clause2 is in a

coordinate structure, then it must be coordinated

with Clause1.

25 H4

Cue H4 is satisfied and the head verbs of Clause1

and Clause2 have the same base form.

10 H39

Clause1 and Clause2 differ in polarity (i.e. one

clause is positive and the other negative).

5 H5

The syntactic subject of Clause1 is the pronoun

some or has the modifier some and the subject of

Clause2 is the pronoun other or has the modifier

other.

30 H6

Figure 53 Cues for the CONTRAST relation

Figure 54 illustrates a simple example of the CONTRAST relation. RASTA

correctly posited a CONTRAST relation on the basis of cues H4 (identifying the

conjunction but) and H5 (identifying the different polarity of the two clauses).

1. An abbess has administrative jurisdiction equivalent to that of

the abbot of a monastery

17.but does not exercise the rights and duties of the priesthood.

Figure 54 Abbess

The circumlocution “is dominated by or contains a CONTRAST

conjunction” in the definition of cue H4 is motivated by constraints on the

well-formedness of RST trees. Figure 55 illustrates a case where two clauses are

dominated by the conjunction but: clause 2, she became involved with a dance

group, and clause 3, and, after rapid progress, won a scholarship with the New

Dance Group, are in a SEQUENCE relation. This turn of events is contrasted

with Primus Pearl’s intention in clause 1 to become a doctor. As described in

section 7.5.1, RASTA will only posit a CONTRAST relation between the

SEQUENCE node and clause 1 if it can find evidence of a CONTRAST relation

between Clause 1 and each of the co-nuclei of the SEQUENCE node.

1. Primus planned to become a doctor,

18.but she became involved with a dance group

19.and, after rapid progress, won a scholarship with the New

Dance Group.

Figure 55 Primus, Pearl

Figure 56 gives RASTA’s analysis of another text involving the

CONTRAST relation. For this text, RASTA posited a contrast relation between

clauses 2 and 3 on the basis of cues H4 (the conjunction however is identified)

and H39 (the base forms of the main verb in each clause is place). It is

interesting as a human analyst to note several pieces of evidence in this text

which further lend support to this analysis but which were not examined by

RASTA. For example, the adverb usually in line two primes a comparison

between a usual analysis and an unusual one. Similarly, the repetition of the

word family and the occurrence of the affix -idae in the Linnean taxonomic

labels Hyaenidae and Protelidae confirm that the CONTRAST relation signaled

by the conjunction however holds between clause 2 and clause 3, and not

between clause 1 and clause 3. Although these additional pieces of evidence are

compelling, they were not necessary for RASTA to correctly identify the relation

here. Most encouraging is the fact that we do not have to encode an

understanding of the Linnaen taxonomic system of classification in order to

process such texts.

1. The aardwolf is classified as Proteles cristatus.

2. It is usually placed in the hyena family, Hyaenidae.




4. For example, the aardwolf has five toes on its forefeet,

5. whereas the hyena has four.

Figure 56 Aardwolf

The some…other construction identified by cue H6 is extremely

common in Encarta. Figure 57 illustrates the analysis of one example of this


Actually, at the moment lines 3 and 4 are not even separated out coz of a misparse in the portrait.

construction.

1. Abrasives are usually very hard substances.

20.Some are used in the form of fine powders;

21.others break in such a way as to form sharp cutting edges

22.and are used in larger pieces.

Figure 57 Abrasives

6.7.7 ELABORATION

In an ELABORATION relation, an asymmetric relation, the satellite

provides additional information for the situation presented in the nucleus

(Mann and Thompson 1988:273). The ELABORATION relation is pervasive in

Encarta. A particularly common discourse structure in Encarta occurs in the

first paragraph of an article. In this common structure, the first sentence of the

first paragraph defines the title noun phrase. The remainder of the first

paragraph consists of one or more text spans in an ELABORATION relation to the

first sentence. This pattern is illustrated in Figure 57 (section 6.7.6).

Interestingly, the ELABORATION relation has only been encountered

between main clauses. No instances of an ELABORATION relation have been

encountered between a main clause and a clause that it is syntactically

dependent on that main clause. Figure 58 gives the necessary criteria for the

ELABORATION relation.


23.Clause1 is not subordinate to Clause2.

24.Clause2 is not subordinate to Clause1.

Figure 58 Necessary criteria for the ELABORATION relation

If the necessary criteria given in Figure 58 are satisfied, then the cues

given in Figure 59 are tested.


i.e. the worked example, when I get around to it.

Cue Heuristic score

Cue name

Clause1 is the main clause of a sentence (sentencei) and

Clause2 is the main clause of a sentence (sentencej) and

sentencei immediately precedes sentencej and (a)

Clause2 contains an elaboration conjunction (also

for_example) or (b) Clause2 is in a coordinate structure

whose parent contains an elaboration conjunction.

35 H24

Cue H24 applies and Clause1 is the main clause of the

first sentence in the excerpt.

15 H26

Clause2 contains a predicate nominal whose head is in

the set {portion component member type kind example

instance} or Clause2 contains a predicate whose head

verb is in the set {include consist}.

35 H41

Clause1 and Clause2 are not coordinated and (a)

Clause1 and Clause2 exhibit subject continuity or (b)

Clause1 is passive and the head of the Dobj of Clause1

and the head of the Dobj of Clause2 have the same base

form or (c) Clause2 contains an elaboration

conjunction.

10 H25

Cue H25 applies and Clause2 contains a habitual

adverb (sometimes usually…)

17 H25a

Cue H25 applies and the syntactic subject of Clause2 is 10 H38

the pronoun some or contains the modifier some.

Figure 59 Cues to the ELABORATION relation

Figure 60 illustrates the application of several cues to the ELABORATION

relation. An ELABORATION relation is posited between Clause1 and Clause2 on

the basis of subject continuity (cue H25) and the habitual adverb usually (cue

H25a). In considering the relationship between Clause1 and Clause3, RASTA

observed that (a) Clause1 is passive and that the base form of the Dobj of

Clause1 is the same as the base form of the Dobj of Clause2 (cue H25) and (b)

the subject of Clause2 has the modifier some (cue H38)11. Finally, RASTA

identifies an ELABORATION relation between Clause3 and Clause4 because

Clause4 contains the connective for example and immediately follows Clause3

(cue H24).

11 This cue is really intended to identify instances in which the noun phrase

containing the word some denotes the same class of entities as the antecedent of that noun

phrase. The application of cue H38 in this case should therefore be considered serendipitous.



26.Some experts, however, place the aardwolf in a separate


between the aardwolf and the hyena

27.For example, the aardwolf has five toes on its forefeet…

Figure 60 Aardwolf

Figure 61 illustrates the application of cue H41. The ELABORATION

relation between Clause1 and Clause2 is identified by means of the head verb

include in Clause2. Similarly, the predicate nominal a portion of an

underground stem in Clause3 acts as a cue to the ELABORATION relation

between Clause2 and Clause3.

1. A stem is a portion of a plant.

28.Subterranean stems include the rhizomes of the iris and the

runners of the strawberry;

29. the potato is a portion of an underground stem.

Figure 61 Stem

6.7.8 JOINT

The JOINT relation is a symmetric relation. It is posited when a

symmetric relation seems to hold between two clauses and no other symmetric

relations seem plausible. If all the conditions given in Figure 62 are satisfied,

then RASTA posits the JOINT relation, giving it a heuristic score of 5. Since the

JOINT relation is a default relation, there are no additional cues that are tested if

the necessary criteria apply.

1. RASTA cannot identify any other symmetric relation between

Clause1 and Clause2.


3. Clause1 is not subordinate to Clause2.

4. Clause2 is not subordinate to Clause1.

5. Clause1 and Clause2 are the same kind of constituent

(declarative, interrogative, etc).

6. The subject of Clause2 is not a demonstrative pronoun, nor is

it modified by a demonstrative.

7. If Clause1 has a pronominal subject then Clause2 must also

have a pronominal subject.

Figure 62 Necessary criteria for the JOINT relation

To satisfy criterion 1, “RASTA cannot identify any other symmetric

relation between Clause1 and Clause2”, RASTA checks for the joint relation only

after it has tested the other symmetric relations (CONTRAST, LIST and

SEQUENCE).

Criteria 3 and 4 use the correlations between syntactic structure and

rhetorical structure (section 6.2) to determine whether a symmetric relation is

possible. Similarly, criteria 6 and 7 use patterns of anaphora, deixis and


Used to have another criterion: Neither Clause2 nor any of the ancestors of Clause2 contains a Contrast conjunction, an AsymmetricContrast conjunction or an Elaboration conjunction. Now of course I say that joint is tested after all others. If either of those things were present, the respective cues would have identified them! Similarly, another old criterion: If Clause2 is in a coordinate construction, then it must be coordinated with Clause1 by means of a Joint conjunction (and and/or).

referential continuity (section 6.3) to further ensure that a symmetric relation is

possible.

If the two clauses being considered are not of the same clause type

(Criterion 5), then it is likely that a more contentful relation exists than the

JOINT relation. Figure 63 illustrates the a case in which a JOINT relation ought

not to be hypothesized. The alternation of a declarative clause (clause 1) and an

interrogative clause (clause 2) here suggests an asymmetric relation. Clause 2

specifies one of the many important questions that arose concerning religion.

Since it specifies additional detail, clause 2 is most likely in an ELABORATION

relation to clause 1.

1. Many other important questions about the nature of religion

were addressed during this period:

2. Can religion be divided into so-called primitive and higher

types?…

Figure 63 Religion

Figure 29 in section 6.7.2, reproduced below as Figure 64, illustrates the

JOINT relation between clauses 2 and 3. No other relation seems plausible

between clauses 2 and 3, so criterion 1 for the JOINT relation is satisfied.

Neither clause is subordinate to the other (criteria 3 and 4); both clauses are of

the same type, namely declarative (criterion 5); the subject of clause 3 is not a

demonstrative, nor is it modified by a demonstrative (criterion 6); clause 2 does

not have a pronominal subject, so criterion 7 is satisfied.

1. Natural (unmedicated) childbirth, however, is becoming more

popular,

30. in part because many women are aware

31.and concerned that the anesthesia and medication given to

them is rapidly transported across the placenta to the unborn

baby.

Figure 64 Pregnancy and childbirth

6.7.9 LIST

The LIST relation is a symmetric relation. Often, clauses in a LIST

relation are also amenable to a SEQUENCE interpretation. In general, a

SEQUENCE relation ought to be preferred over a list relation if there is any

evidence that the author might have preferred a SEQUENCE interpretation. For

example, explicit indication of temporal sequencing prefers a SEQUENCE

relation. Figure 65 gives the necessary criteria for the LIST relation.




34.The subject of Clause2 is not a demonstrative pronoun, nor is


35.Clause1 and Clause2 agree in polarity.

36.There is not alternation where the syntactic subject of Clause1

is the pronoun some or has the modifier some and the subject

of Clause2 is the pronoun other or has the modifier other.

37. If the syntactic subject of Clause2 is a pronoun, then the

syntactic subject of Clause1 must be the same pronoun.

38.Clause2 is not dominated by and does not contain

conjunctions compatible with the CONTRAST, ASYMMETRIC-

CONTRAST or ELABORATION relations.

Figure 65 Necessary criteria for the LIST relation

Criteria 2 and 3 use the correlations between syntactic structure and

rhetorical structure (section 6.2) to determine whether a symmetric relation is

likely. Similarly, criteria 6 and 7 use patterns of anaphora, deixis and

referential continuity (section 6.3) to further ensure that a symmetric relation is

possible. Criterion 5, “Clause1 and Clause2 agree in polarity”, is intended to

distinguish the LIST relation from the CONTRAST relation (section 6.7.6).

Similarly, criterion 8, “Clause2 is not dominated by and does not contain

conjunctions compatible with the CONTRAST, ASYMMETRICCONTRAST or

ELABORATION relations” is intended to distinguish the LIST relation from other

relations.

If the necessary criteria given in Figure 65 are satisfied, the additional

cues given in Figure 66 are tested.

Cue Heuristic score

Cue name

Clause1 and Clause2 both contain enumeration

conjunctions (first second third…)

15 H7

Clause1 is passive or contains an attributive

predicate and Clause2 is passive or contains an

attributive predicate.

10 H8

Clause2 is in a coordinate construction and the

coordinating conjunction is a LIST conjunction

(also and…)

10 H9

Clause1 and Clause2 both contain a Dobj and the

heads of those Dobjs have the same base form.

5 H10

Figure 66 Cues to the LIST relation

Figure 67 illustrates the application of cues H7 and H8 in the correct

identification of a LIST relationship. Cue H7 identifies the enumeration

conjunctions first and second. Cue H8 identifies the passive voice of both

clauses.


I need to find some cases where these cues fire and actually lead to a LIST relation being posited in a correct tree. I have examples of where they fire but then get correctly outscored by Sequence relations.

1. Psychotherapy differs in two ways from the informal help one

person gives another.

2. First, it is conducted by a psychotherapist who is specially

trained and licensed or otherwise culturally sanctioned.

3. Second, psychotherapy is guided by theories about the

sources of distress and the methods needed to alleviate it.

Figure 67 Psychotherapy

Cue H8 identifies sequences of clauses that are either passive or have

attributive predicates. Attributive predicates are identified using relatively

simple criteria: the main verb is be or the main verb is have with a direct object

that denotes an attribute, e.g. a body part. Figure 68 illustrates a sequence of

clauses typical of the description of animals in Encarta. The first three clauses

in Figure 68 contain attributive predicates: be plus a length and have plus body

parts. Clause 5 is parsed by MEG as a passive sentence, and identified by

RASTA as another nucleus in a LIST relation with clauses 1 through 3. Figure 68

thus demonstrates that RASTA is able to correctly construct discourse

representations, even given an occasional misparse by MEG.12 Were clause 5

parsed as containing a main verb be followed by a past participle, cue H8

would still succeed in identifying a LIST relation.

1. The short-nosed echidna found in Australia is about 35 to 53

cm long …,

39.and has a broad body mounted upon short, strong legs.

40.The legs have powerful claws,

41.adapting the animal for rapid digging into hard ground.

42.The back is covered with stiff spines…

Figure 68 Echidna

12 The goal of ongoing development of MEG is of course to eliminate bad parses like

these.

6.7.10 MEANS

The MEANS relation is an asymmetric relation, in which the satellite

presents the means by which the situation in the nucleus has come about. To

date, the MEANS relation has only been identified in Encarta within single

sentences. If the Subordinate Clause Condition is satisfied, then the cue given

in Figure 69 is tested.

Cue Heuristic score

Cue name

Clause2 contains a MEANS conjunction (by…) 20 H44

Figure 69 Cue to the MEANS relation

Figure 70 illustrates the application of cue H44 to identify the MEANS

relation.

1. Various residential complexes of clay and stone were built

2. by piling rooms and terraces onto one another.

Figure 70 Pre-Columbian Art and Architecture

6.7.11 PURPOSE

The PURPOSE relation is an asymmetric relation. Mann and Thompson

(1988:276) note that “[the reader] recognizes that the activity in [the nucleus] is

initiated in order to realize [the satellite].” If the Subordinate Clause Condition

is satisfied, then the cues given in Figure 71 are tested.

Cue Heuristic score

Cue name

Clause2 is an infinitival clause. 5 H15

Clause2 or one of the ancestors of Clause2 contains

a purpose conjunction (in_order_to so_that).

10 H16

Figure 71 Cues to the PURPOSE relation

Figure 72 illustrates a nesting of PURPOSE relations. Cue H15 identifies

the PURPOSE relation between Clause1 and Clause2 on the basis of the infinitival

clause to learn the language. Cues H15 and H16 both identify the PURPOSE

relation between Clause2 and Clause3 on the basis of the infinitival clause in

order to translate fairytales and the connective in order to.

1. Ransome left alone for Russia in 1913

43. to learn the language

44. in order to translate fairytales.

Figure 72 Ransome, Arthur Michell

6.7.12 RESULT

As noted in section 6.7.2, RST distinguishes between CAUSE and

RESULT relations. In a RESULT relation, the result is expressed in the satellite,

and the cause in the nucleus. As noted above (section 3.6), I have collapsed the

two relations VOLITIONAL RESULT and NON-VOLITIONAL RESULT originally

proposed by Mann and Thompson (1988) into a single relation, RESULT.

RST does not impose ordering constraints on the constituents of an

asymmetric relation (Mann and Thompson 1988:248). In Encarta, however, the

Satellite in a RESULT relation always follows the nucleus.

RASTA uses different cues for the RESULT relation depending on

whether the Subordinate Clause Condition is satisfied. These cues are presented

separately below.

Criteria for the RESULT relation when the Subordinate Clause

Condition is satisfied

The Subordinate Clause Condition defines the necessary criteria that

must be satisfied before the cues to the RESULT relation given in Figure 73 are

tested.

Cue Heuristic score

Cue name

The head of Clause2 follows the head of Clause1;

and Clause2 is a detached –ing participial clause;

and if Clause2 is subordinate to a NP, then the

parent of that NP must be Clause1.

15 H22

Clause2 follows Clause1 and Clause2 contains a

result conjunction (as_a_result consequently so…)

35 H23

Figure 73 Cues to the RESULT relation

The last part of the conditions of cue H22, “if Clause2 is subordinate to

a NP then the parent of that NP must be Clause1” is intended to resolve a

common misparse in which a detached participial clause is incorrectly

subordinated to a NP, as illustrated in Figure 74.


The code says “And the –ing clause is not subordinate to a PP”, but in English, saying it is a detached –ing participial clause covers that possibility.

Figure 74 Misparse of a detached participial clause

As Figure 75 shows, the use of cue H22 enables RASTA to hypothesize a

plausible RESULT relation for the misparsed excerpt in Figure 74.

1. This bold strategy gave them an advantage,

2. creating confusion.


Figure 76 illustrates the application of cue H23. The phrase as a result

is correctly identified as a cue to the RESULT relation.

1. Ramsey used two separate magnetic fields;

2. as a result, he achieved vastly increased accuracy in the

measurements.

Figure 76 Ramsey, Norman Foster

In Figure 77, the phrase as a consequence is correctly identified by cue

H23 as a cue to the RESULT relation.

1. Islam arose as a powerful reaction against the ancient pagan

cults of Arabia,

45.and as a consequence it is the most starkly monotheistic of

the three biblically rooted religions.

Figure 77 God

Necessary criteria for the RESULT relation when the Subordinate


Figure 78 gives the necessary criteria for the CAUSE relation when the

Subordinate Clause Condition is not satisfied.




48.Either the subject or Dsub of Clause2 is a demonstrative

pronoun or is modified by a demonstrative; or the Dsub of

Clause1 and the Dsub of Clause2 are distinct constituents (i.e.

neither one is gapped) and the same lexical item occurs as

the head of the Dsub of Clause1 and the Dsub of Clause2 and

that lexical item is not a pronoun; or Clause1 and Clause2 are

coordinated by a semi-colon.

Figure 78 Necessary criteria for the RESULT relation when the Subordinate


Criteria 2 and 3 are intended to isolate main clauses. Criterion 4

requires some indication that an asymmetric relation is motivated: either


Define Dsub

coordination by means of the semi-colon, or patterns of anaphora, deixis and

referential continuity that correlate strongly with asymmetric relations (section

6.3).

Cues for the RESULT relation when the Subordinate Clause


If the criteria given in Figure 78 are satisfied and neither an

ELABORATION relation (section 6.7.7) nor a CAUSE relation (section 6.7.2) has

been identified, then it is reasonable to posit a RESULT relation. The RESULT

relation is given an initial score of 5, and the cues given in Figure 79 are tested.

Cue Heuristic score

Cue name

Clause2 contains a result conjunction

(consequently…)

10 H32

Clause2 contains the phrase result in, with the verb

possibly being inflected.

10 H33

Clause2 is not passive, and the predicate of Clause2

has as its head a verb that entails a result (cause

make…)

5 H34

Figure 79 Cues for the RESULT relation when the Subordinate Clause


The phrase result in, identified in its various inflected forms by cue

H33, is extremely common in Encarta. Figure 80 illustrates one example.

1. The most frequent cause, however, is chronic abuse of the

vocal apparatus, either by overuse or by improper production

of the voice;

2. this may result in such pathological changes as growths on or

thickening and swelling of the vocal cords.

Figure 80 Speech and Speech Disorders

For cue H34, the requirement that the clause not be passive ensures that

the RESULT relation is correctly distinguished from the CAUSE relation

identified by cue H29a (section 6.7.2). Figure 81 illustrates the application of

cue H34.

1. Propane forms a solid hydrate at low temperatures,

2. and this causes great inconvenience

3. when a blockage occurs in a natural-gas line.

Figure 81 Propane

6.7.13 SEQUENCE

As Mann and Thompson (1987:74) note, the SEQUENCE relation is

unique among RST relations in that the order of its constituents is important.

Mann and Thompson also note that

“Temporal succession is not the only type of succession for

which the Sequence relation might be appropriate. Others could

include descriptions of a group of cars according to size or cost,


There is a little comment in M&T 1987 that is not in M&T 1988.

colors of the rainbow, who lives in rows of apartments, etc.”

(1987:74)

In Encarta, however, all instances of the SEQUENCE relation encountered to

date have involved temporal succession. The SEQUENCE relation is used in

Encarta to express a narrative sequence of events. It is therefore not surprising

that many of the criteria proposed below for identifying the SEQUENCE relation

resemble those proposed in the linguistics literature for identifying narrative

clauses (for example, Labov 1972, Reinhart 1984).

Necessary criteria for the SEQUENCE relation

Figure 82 gives the necessary criteria for the SEQUENCE relation.


There is a little comment in M&T 1987 that is not in M&T 1988.




51.The subject of Clause2 is not a demonstrative pronoun, nor is


52.Neither Clause1 nor Clause2 has progressive aspect (marked

by the -ing verbal suffix).

53. If either Clause1 nor Clause2 has negative polarity, then it

must also have an explicit indication of time.

54.Neither Clause1 nor Clause2 is a Wh question.

55.Neither Clause1 nor Clause2 has an attributive predicate.

56.The event expressed in Clause2 does not temporally precede

the event in Clause1; nor does the event expressed in Clause2

occur within the time span covered by the event expressed in

Clause1.

57.Clause1 and Clause2 match in tense and aspect.

58.Clause2 must not be immediately governed by a contrast

conjunction.

Figure 82 Necessary criteria for the SEQUENCE relation


Terminology?


Insist on this?

If the necessary criteria given in Figure 82 are satisfied, it is reasonable

to posit a SEQUENCE relation between two clauses. The necessary criteria are

sufficiently stringent that an initial heuristic score of 20 is associated with this

hypothesized relation. A few of the necessary criteria for the sequence relation

merit special discussion.

Criteria 2 and 3 are intended to bar situations in which one clause is

syntactically dependent on another.

Criterion 51, “The subject of Clause2 is not a demonstrative pronoun,

nor is it modified by a demonstrative”, is intended to block cases in which the

correlations deixis and discourse structure (section 6.3) would make an

asymmetric relation more likely than the symmetric SEQUENCE relation. For

example, in the following excerpt, a SEQUENCE relation is dispreferred in the

face of a more plausible RESULT relation.

He made a study of the famous Adams family of

Massachusetts, to which he was not related; this study

resulted in “The Adams Family”… (Adams, James

Truslow).

As noted above (this section), the SEQUENCE relation is used in Encarta

to express a narrative sequence of events. Criterion 5, “Neither Clause1 nor

Clause2 has progressive aspect (marked by the –ing verbal suffix)”, is intended

to preclude clauses which are not eventive, as in the following example:

Abbott was willing to admit a number of manufactured

goods from the United States duty-free. (Abbott, Sir John

Joseph Caldwell)

For the most part, clauses with negative polarity do not express events

and therefore cannot enter into the SEQUENCE relation. One notable exception

to this generalization is clauses with negative polarity which also contain an

explicit indication of time (Criterion 6), as illustrated in Figure 84 and Figure

85. (The sentences with negative polarity and an explicit indication of time are

in bold type.)

Figure 83 gives the logical form for clause 2 in Figure 84. RASTA takes

note of the +NEG feature on the main verb make, but does not rule out a the

possibility of a SEQUENCE relation between clauses 2 and 3 since there is an

explicit indication of time in the form of the TMEAT attribute in the logical

form, annotated with the features +DATE +YEAR.

Figure 83 Logical form illustrating negative polarity

1. Although AIDS has been tracked since 1981,

2. the identification of HIV as the causative agent was not

made until 1983.

3. In 1985 the first blood test for HIV, developed by the research

group led by Robert Gallo, was approved for use in blood

banks.

Figure 84 Acquired Immune Deficiency Syndrome

1. Born in Paris, Moissan did not begin his formal academic

training

2. until he was in his early 20s…

3. In 1886 Moissan was appointed professor in toxicology at the

École Supérieure de Pharmacie…

4. In 1889 he became professor of inorganic chemistry at this

same institution

5. and the following year succeeded to the chair of inorganic

chemistry at the Faculté des Sciences.

Figure 85 Moissan, Ferdinand-Frederic-Henri

The negative clauses in Figure 84 and Figure 85 entail events which are in a

SEQUENCE relation with other events. The presence of an explicit indication of

time within a negative clause appears to be sufficient to identify this

entailment. Prepositional phrases and subordinate clauses introduced by until or


Make sure that this example actually parses before I publish!

before are the most common means of explicitly indicating time for clauses

with negative polarity in Encarta.

Neither Wh questions (Criterion 7) nor attributive predicates (Criterion

8; see section 1 concerning attributive predicates) report events. They therefore

cannot participate in SEQUENCE relations. Changes in state, unlike attributive

predicates, can however participate in SEQUENCE relations. Clause 2 in Figure

86, and [Abacha] became a captain in the army in 1967, illustrates a change of

state.

1. Born in Kano, in northern Nigeria, Abacha graduated from the

Nigerian Military Training College in Zaria in 1963,

2. and became a captain in the army in 1967.

Figure 86 Abacha, Sani

Criteria 1 and 9 together constitute the traditional minimal definition of

a narrative (Labov 1972; Reinhart 1984): a narrative sequence is one in which a

series of tensed clauses report a sequence of events, with the linear order of the

clauses expressing the events matching the real-world temporal order of those

events. Criterion 9 is illustrated in Figure 89 below.

The last necessary condition for the SEQUENCE relation is Criterion 11,

“Clause2 must not be immediately governed by a contrast conjunction”. This

criterion is needed to ensure that in a handful of cases a more plausible

CONTRAST relation is selected over a possible SEQUENCE relation, as in the

following example:

At first Buthelezi opposed this system but then decided to

work within it. (Buthelezi, Mangosuthu Gatsha).

In this example, there are several cues suggesting temporal sequence: the

phrase At first, the conjunction then and the eventive clauses. The use of a

strongly cohesive device, the conjunction but, compatible with the CONTRAST

relation, favors a interpretation in which Buthelezi’s position at different times

is being contrasted rather than an interpretation in which events are merely

being cast as temporally ordered.


And the author could have just said “At first B. opposed this system. Later he decided to work within it.” if they wanted to favor a Sequence reading.


This text needs work.

Additional cues for the SEQUENCE relation

Provided that the necessary criteria are satisfied, the heuristic score

associated with the hypothesized relation may be incremented if any of the

additional cues given in Figure 87 obtain.

Cue Heuristic score

Cue name

Clause2 contains a SEQUENCE conjunction (and

later then…)

10 H2

Clause1 and Clause2 are coordinated 5 H2b

There is an explicit indication that the event

expressed by Clause1 temporally precedes the

event expressed by Clause2.

5 H3

Figure 87 Cues for the SEQUENCE relation

The presence of a SEQUENCE conjunction (for example, and, later, then)

is not a necessary criterion, although it is a cue which receives a higher

heuristic score than any other single non-necessary cue for the SEQUENCE

relation. Figure 88 illustrates the application of cue H2 to identify the

SEQUENCE conjunction then in clause 2 and the instances of and in clauses 3

and 4. Cue H2b, “Clause1 and Clause2 are coordinated”, also identifies the

instances of and in clauses 3 and 4. This double identification is not redundant,

however. Since RASTA constructs RST trees from the bottom up in a binary-

branching manner (chapter 7), this double identification causes the cohesive

bond between clauses 2, 3 and 4 to be very strong indeed. By assigning a

greater heuristic score to reflect this strongly cohesive bond, RASTA ensures

that during the construction of RST trees, better analyses will be produced

earlier.

1. Napoleon met defeat in 1814 by a coalition of major powers,

notably Prussia, Russia, Great Britain, and Austria.

59.Napoleon was then deposed

60.and exiled to the island of Elba

61.and Louis XVIII was made ruler of France.


Explicit indications of time are of great value in determining whether a

SEQUENCE relation is plausible. Criterion 9, “The event expressed in Clause2

does not temporally precede the event in Clause1; nor does the event expressed

in Clause2 occur within the time span covered by the event expressed in

Clause1” (Figure 82), is intended to exclude cases in which there is clear

counter-evidence, making a SEQUENCE relation unlikely. In Figure 89, for

example, the events described in clauses 2 through 7—conferences being held,

agreements being made, and so on—occur during the 1920s, the timeframe

described in clause 1. RASTA identifies the timeframe of the expression the

1920s by the presence of a definite article with a numeric year, together with

the presence of the plural suffix –s. The timeframe thus identified spans the

first day of 1920 to the last day of 1929. It is a matter of simple math to

determine that the dates 1920 (clause 2), 1921-1922 (clause 4), 1925 (clause 5)

and 1928 (clause 6) fall within this interval.

Clause 1 describes a temporal interval within which the events

described in clauses 2 through 7 of Figure 89 occur, rather than describing an

event that precedes the events of the remaining clauses. RASTA therefore does

not posit a SEQUENCE relation between clause 1 and any of the following

clauses. Rather, clause 1, the topic sentence of this paragraph, is in an

ELABORATION relation with the SEQUENCE node that spans clauses 2 through 7.

Clauses 2 through 7 satisfy criterion 9, since the temporal order of the

events described matches the temporal order of the events in the world and

none of the clauses describes a temporal interval within which the events of any

of the other clauses occurs. Cue H3 identifies the appropriate sequencing of the

temporal expressions in each of the relevant clauses, leading RASTA to posit the

SEQUENCE node depicted in Figure 89.

1. During the 1920s, attempts were made to achieve a stable

peace.

2. The first was the establishment (1920) of the League of Nations

as a forum in which nations could settle their disputes.

3. The league's powers were limited to persuasion and various

levels of moral and economic sanctions that the members were

free to carry out as they saw fit.

4. At the Washington Conference of 1921-22, the principal naval

powers agreed to limit their navies according to a fixed ratio.

5. The Locarno Conference (1925) produced a treaty guarantee of

the German-French boundary and an arbitration agreement

between Germany and Poland.

6. In the Paris Peace Pact (1928), 63 countries, including all the

great powers except the USSR, renounced war as an

instrument of national policy

7. and pledged to resolve all disputes among them “by pacific

means.”

Figure 89 World War II

The identification of temporal expressions is relatively simple in Figure

89, since all references are to years. Very often, however, temporal expressions

in Encarta differ in granularity. In Figure 91, for example, there are two

references of the form month-day-year (clause 1 and clause 2) and one

reference of the form month-year (clause 3). A simple function in RASTA

compares dates, allowing for differing granularities. The steps followed in this

function are illustrated in Figure 90. As soon as the function is able to

determine whether one date precedes another, it terminates. For example, in

comparing the dates February 26, 1815 and March 20, 1815 in Figure 91,

RASTA compares the years and finds that they are the same. It then compares

the months, and finds that March follows February. Having determined this,

RASTA does not need to compare the days (step 3) or the time (step 4).

1. Compare the years.

62. If the years are the same or at least one temporal expression

does not include a year, compare the months.

63. If the months are the same or at least one temporal

expression does not include a month, compare the days.

64. If the days are the same or at least one temporal expression

does not include a day, compare the time of day.

Figure 90 Compare dates

1. On February 26, 1815… Napoleon escaped from Elba…

65.and on March 20, 1815, he again ascended the throne.

66.On March 17 Austria, Great Britain, Prussia, and Russia each

agreed…


MEG provides robust handling of dates and times expressed in a wide

range of formats. More elaborate processing of date and time information has

not yet proven necessary for the analysis of Encarta. Narrative clauses

containing relative expressions of time, for example two months later, tend also

to contain other cues that enable the correct identification of the SEQUENCE

relation.

7. Constructing Trees

7.1 Introduction

In this chapter I present the complete process by which RASTA

computes representations of the structure of a written text, from positing

discourse relations between clauses to producing and evaluating RST trees. As

noted above (sections 3.6 and 6.1), the general strategies described here for

constructing RST trees given a set of hypothesized relations would not be

affected if a different set of RST relations were used.

7.2 The need for an improved algorithm

The algorithm presented in Marcu (1996, 1997a) represents a

considerable advance in the formalization of a procedure for constructing RST

trees. Still, Marcu’s algorithm has a number of weaknesses:

1. No method is given for positing RST relations for clauses that do not

contain cue phrases.

2. The algorithm suffers from combinatorial explosion–as the number

of relations increases, the number of RST trees produced increases

exponentially (section 4.2). It is an unavoidable fact that the number

of well-formed RST trees increases exponentially as the number of

hypothesized relations increases. However, a great many of the trees

that are produced by Marcu’s algorithm are subsequently rejected as

ill-formed. The algorithm would be greatly improved if ill-formed

trees were not even constructed in the first place.

3. The metric for evaluating trees is specific to genre, working well

only for texts with a right-branching structure.

4. Only binary-branching trees are produced, whereas (with the

exception of Matthiessen and Thompson (1988)) n-ary branching

trees have generally been proposed in the RST literature.

RASTA improves on Marcu’s algorithm in the following ways:

1. RASTA contains an explicit means for positing RST relations based

on an examination of a text that is not wholly dependent on the

presence of cue phrases.

2. RASTA does not produce ill-formed trees. As soon as RASTA detects

that its bottom-up construction of a tree would lead to an ill-formed

tree, it aborts processing for all trees that would have contained the

tree fragment constructed so far. Furthermore, since RASTA

produces preferred RST analyses before dispreferred ones, it is not

usually necessary to compute all possible trees in order to find a tree

that an RST analyst would consider to be the most plausible analysis

of a text.

3. RASTA has a general domain-independent metric for evaluating

trees. Trees can be ranked by summing the heuristic scores of the

relations used to construct them.

4. RASTA produces n-ary branching trees.

7.3 Identify terminal nodes

For each sentence in the data being analyzed, the syntactic constituent

corresponding to each node in the logical form is examined in order to

determine whether it is a terminal node in an RST diagram. Figure 92 gives the

conditions that must be met for a node to be considered to be a terminal node in

an RST diagram.

1. The head of the constituent is a verb or the constituent is an

elliptical clause.

2. The head of the constituent is not an auxiliary.

3. An object complement is only allowed in the deontic have to

construction, for example: The pontiff allowed most of the English

customs, but Henry had to bow to canon law… (Beckett, Thomas a)

4. The constituent is not a subject complement.

5. Parse workarounds: (a) If the parent of the constituent is an NP then

the constituent can only be a terminal node in an RST diagram if it is

a present participial clause.13 (b) Detached participial clauses whose

head is a past participle cannot be terminal nodes.14

6. If the constituent is a complement clause, then it cannot have an NP

or PP as its parent.

13 To resolve a misparse in which a detached participial clause is incorrectly

subordinated to an NP. For example: This bold strategy gave them an advantage, thus creating

confusion. (Waterloo, Battle of, discussed in section 6.7.12).

14 In the following example the clause led by Charles Martel is currently parsed by

MEG as a detached participial clause, rather than as a participial clause subordinate to the NP

the Franks: His army met the Franks, led by Charles Martel, near Tours, France, later that

year. (Abd-ar-Rahman)


Is this condition still valid?

7. The constituent cannot be a relative clause.

8. The constituent cannot have a relative clause as one of its ancestors

(in order to avoid undue granularity; See section 3.2, footnote Error:

Reference source not found).

Figure 92 Criteria for an RST terminal node

Condition 1 allows elliptical clauses as terminal nodes, as in the

following example, discussed in section 6.7.4.

1. Although timid,

2. the aardvark will fight

3. when it cannot flee.

Figure 93 Aardvark

7.4 Posit hypotheses

The result of identifying the terminal RST nodes according to the

criteria given in section 7.3 is a set of terminal nodes. Given this set of terminal

nodes, RASTA examines each pair of clauses in order to determine which

rhetorical relations to posit according to the criteria in section 6.7.

For a set of n clauses, n(n-1) pairs of clauses are examined to see if any

of the thirteen rhetorical relations (section 3.5) ought to be posited. In practice

these inspections can be performed at very little computational expense. For

example, a violation of any one of the necessary criteria for a rhetorical relation

is sufficient grounds to preclude further consideration of that relation.

A set of hypothesized relations results from the pairwise examination of

clauses. Each of these hypotheses is a simple data structure consisting of

attributes and values. Figure 94 illustrates the data structure used to represent a

symmetric relation. The Nodename is an internal designation used within MEG

to refer to this hypothesis. The value of the Nodename attribute is simply value

of the Pred attribute combined with a unique integer (i.e. the first record of this

type will be called RSTrec1, the second RSTrec2, and so on). The

RelationValue and TreeValue attributes contain the sum of the heuristic scores

of the cues which led to this relation being posited. (In subsequent processing,

the TreeValue attribute will be incremented as nodes are joined together; see

section 7.5.3.) The Relations attribute is a list of the names of relationships

which might hold between these two nodes. (As noted in section 3.6, it is

sometimes the case that several RST relations could equally well be said to hold

between two nodes.) Although the CONTRAST relation is symmetric, i.e. it

contains two nuclei, in the initial representation I distinguish a Nucleus and a

CoNucleus. This distinction is motivated solely by the desire to simplify

processing by creating a structure that is analogous to the Nucleus / Satellite

distinction in Figure 95. During the transformation of a binary to an n-ary tree,

the Nucleus and CoNucleus attributes become members of a list of nuclei. The

values of the Nucleus and CoNucleus attributes are pointers to nodes in the

logical form. Finally the attribute Heurs contains debugging information

concerning which cues were involved in hypothesizing this rhetorical relation

and the heuristic score associated with each cue, in this case cue number 39

with a value of ten and cue number four with a value of 25. The following

excerpt was analyzed by RASTA to produce the hypothesized relation given in

Figure 94.

It is usually placed in the hyena family, Hyaenidae. Some

experts, however, place the aardwolf in a separate family,

Protelidae… (Aardwolf)

Nodename RSTrec1

RelationValue 35

TreeValue 35

Pred RSTrec

Relations (Contrast)

Nucleus place1

CoNucleus place2

Heurs (H39:10 H4:25)

Figure 94 Data structure of a hypothesized symmetrical rhetorical relation

Figure 95 illustrates the data structure used to represent an asymmetric

relation. This data structure differs from the one in Figure 94 in only one

attribute: the attribute Satellite occurs in place of the attribute CoNucleus.

This bold strategy gave them an advantage, creating

confusion. (Trafalgar, Battle of)

Nodename RSTrec1

RelationValue 15

TreeValue 15

Pred RSTrec

Relations (Result)

Nucleus give1

Satellite create1

Heurs (H22:15)

Figure 95 Data structure of a hypothesized asymmetric rhetorical relation

7.5 Construct trees

Given a set of terminal nodes, and a set of relations that have been

hypothesized to hold between those terminal nodes, the task is to construct and

evaluate the possible RST trees. RASTA operates from the bottom up, permuting

the hypothesized relations and gathering terminal nodes into contiguous text

spans.

7.5.1 Promotion sets

Marcu (1996) employs the notion of a promotion set for an RST sub-

tree, similar to the syntactic notion of the head of a constituent. Promotion sets

are used to guide the production of RST trees, constraining the structures that

are produced: A relation can be said to hold between two text spans a and b if

and only if that same relation can be said to hold between the members of the

promotion set of a and the members of the promotion set of b. For a terminal

node, the promotion set consists only of the terminal node itself. For an

asymmetric RST sub-tree, the promotion set consists of a single element, the

nucleus. For a symmetric RST sub-tree, the promotion set consists of the union

of the promotion sets of the co-nuclei. The notion of a promotion set is central

to the algorithm used for constructing RST trees from the bottom up (section

7.5.3). During the production of an RST tree, RASTA observes the constraint

that the same relation must hold between all members of the promotion set of a

and all members of the promotion set of b, and is thus able to avoid the

production of ill-formed trees. The notion of a promotion set is perhaps best

explained by considering the bottom-up construction of the Rhetorical

Structure sub-trees given in Figure 96 and Figure 97, leaving aside for the

moment details about the basis for positing the relations.

Figure 96 depicts a binary-branching tree representing the structure of

an excerpt from the Abd-ar-Rahman article. (This structure would be converted

into an n-ary branching representation, as described in section 7.5.4). Clauses 2

and 3 are in a Circumstance relation, with clause 2 as the nucleus and clause 3

as the satellite. Since the Circumstance relation is asymmetric, the promotion

set of the text span covering clauses 2 and 3 consists of a single element, the

nucleus, clause 2. Clause 4 is in a SEQUENCE relation with the single member of

the promotion set of the text span covering clauses 2 and 3. This SEQUENCE

relation yields the text span covering clauses 2 through 4. Since the sequence

relation is symmetric, the promotion set of this text span is equal to the union

of the promotion set of the two co-nuclei, i.e. {2, 4}. Since clause 1 is in a

SEQUENCE relation with both members of the promotion set of the text span

covering clauses 2 through 4 (i.e. clause 1 must be in a sequence relation with

clause 2 and with clause 4), clause 1 can be said to be in a SEQUENCE relation

with the text span covering clauses 2 through 4. The promotion set of this new

text span covering clauses 1 through 4 is the union of the promotion set of

clause 1 (the terminal node itself) and the text span covering clauses 2 through

4 ({2, 4}).

1. He became governor of southern France in 721.

2. In 732, … , he led an army across the Pyrenees Mountains

into the dominions of the Franks.

3. when the growth of Frankish power menaced the Muslim

position in Spain

4. His army met the Franks,…

Figure 96 Binary-branching tree for Abd-ar-Rahman excerpt

In Figure 97, the ELABORATION relation must hold between the

promotion of clause 1 (the terminal node itself) and both members of the

promotion set {2, 3}, i.e. there must be an ELABORATION relation between

clauses 1 and 2 and between clauses 1 and 3. Since the ELABORATION relation

is asymmetric, the promotion set of the resulting text span consists of a single

element, clause 1.

1. The aardwolf is classified as Proteles cristatus

2. It is usually placed in the hyena family, Hyaenidae


family, Protelidae…

Figure 97 Binary-branching tree for Aardwolf excerpt

7.5.2 Group mutually exclusive hypotheses

RASTA often posits more than one relation between two terminal nodes.

Relations of the same type, i.e. symmetric or asymmetric, are merged into an

underspecified representation (section 3.6) if they have the same heuristic

score. In that case, what is represented is whether the relation is symmetric or

asymmetric, the heuristic score associated with that relationship and a list of

possible labels. Figure 98 illustrates the data structure used to represent a case

where two asymmetric relations, RESULT and ELABORATION were posited for

the same two terminal nodes. The Relations attribute of one hypothesized

relation contained the value RESULT, and the Relations attribute of the other

hypothesized relation contained the value ELABORATION. In merging the two

relations into a single underspecified relation, a new Relations attribute was

constructed, containing the union of the Relations attributes of the two

relations: {RESULT, ELABORATION}.


Spell out why they should not be merged if they do not have the same heuristic score?

Nodename RSTrec1

RelationValue 15

TreeValue 15

Pred RSTrec

Relations (Result Elaboration)

Nucleus give1

Satellite create1

Heurs (H22:15)

Figure 98 Data structure of an underspecified asymmetric rhetorical

relation

Even after appropriate relations have been merged into an

underspecified representation, there are likely to be mutually exclusive

relations in the set of relations used to construct RST trees. If relations are

successively applied in the construction of RST trees, then applying mutually

exclusive relations will clearly lead to wasted computational effort in the search

for possible RST trees. Mutually exclusive relations linking the same two

terminal nodes are therefore grouped together into sets called “bags”. For

example, if a SEQUENCE relation is applied to clauses a and b, then an

ELABORATION relation linking a and b cannot subsequently be applied. If there

is only one hypothesized relation linking two clauses, then a bag is created

consisting solely of that relation.

An important goal of RASTA is to produce preferred RST trees before

dispreferred ones by using the highest scoring hypotheses first. RASTA

therefore sorts the relations within each bag according to heuristic score.

Finally, a list of bags is produced, with each bag occurring in sorted order

according to the value of the highest ranking relation within it.

7.5.3 Produce and rank binary-branching trees

In the outline of the algorithm below, I refer to the following variables:

SUBTREES: a list of the RST subtrees constructed so far. Since RASTA only

produces well-formed trees, all members of this list are guaranteed to be

well-formed trees. Initially, SUBTREES contains a list of the RST terminal

nodes. As nodes corresponding to contiguous text spans are grouped

together to form larger text spans, SUBTREES contains fewer and fewer

members. When SUBTREES contains a single member, RASTA has succeeded

in constructing a complete RST tree to represent the text. If RASTA

unsuccessfully applies all posited hypotheses in an effort to construct a tree,

then SUBTREES will contain more than a single element.

HYPOTHESES: a list of bags of hypotheses. Hypotheses within bags are

sorted according to heuristic score. Bags are initially sorted according to the

heuristic score of the first element.

ALLHYPOTHESES: an unordered list of all the hypotheses posited.

At the most abstract level, the algorithm can be described as follows:

Construct all RST trees compatible with the set of the

hypotheses by gathering up the text into contiguous text

spans. Store each unique analysis that covers the entire text.

Figure 99 gives a pseudo-code description of a function

CONSTRUCTTREE that constructs binary-branching RST trees. To aid the reader,

comments occur in italics following two forward slashes.

If allowed to run to completion, CONSTRUCTTREE would create all

possible well-formed RST trees that are compatible with the hypothesized

discourse relations. As actually implemented, however, the researcher specifies

a desired number of trees—usually ten or twenty. CONSTRUCTTREE then

produces either the stipulated number of trees or all possible trees, whichever is

the smaller number. Since the algorithm produces better trees first, it is usually

not necessary to produce many trees before an analysis is produced that an RST

analyst would consider to be plausible.

The recursive, back-tracking nature of CONSTRUCTTREE prevents the

construction of a great number of ill-formed trees. For example, consider an

imaginary set of five RST hypotheses, R1… R5, where applying R2 after R1

results in an invalid tree. Rather than attempting to construct RST hypotheses by

testing all permutations of these five hypotheses and then examining the trees

only to discover that trees formed by applying {R1 R2 R3 R4 R5} or {R1 R2 R3 R5

R4}and so on were invalid, CONSTRUCTTREE applies R1, then R2. It

immediately determines that an ill-formed subtree results, and so does not

bother to complete the construction of any trees that would follow from those

first two steps. A total of six trees are thus not even produced, resulting in

considerable gains in efficiency.

Function ConstructTree (HYPOTHESES, SUBTREES)Begin Function ConstructTreeLet COPYHYPOTHESES be equal to a copy of the list HYPOTHESES.If the desired number of trees has been constructed

Return.Else If SUBTREES has only one element:

If this RST tree is not identical to one that has already been stored, then store it.Return.

Else If COPYHYPOTHESES contains at least one element and SUBTREES has more than one element ThenFor each bag in COPYHYPOTHESES

Let ONEBAG denote the current bag.Let REMAININGBAGS be equal to COPYHYPOTHESES except the current bag.If projections of elements in SUBTREES match the nucleus and other element (satellite or co-nucleus) specified by the hypothesized relations in ONEBAG, then

For each hypothesis in ONEBAG, going from the highest scored hypothesis to the lowest scored:

Let ONEELEMENT denote the current hypothesis1. Search in SUBTREES for elements with the promotions specified by

ONEELEMENT. 2. Let NUC be the subtree whose promotion set includes the nucleus specified by

ONEELEMENT.3. Let OTHER be the subtree whose promotion set includes the other member (a

satellite or a co-nucleus) specified by ONEELEMENT.4. In ALLHYPOTHESES, there must be a hypothesized relation between every

member of the promotion set of NUC and every member of the promotion set of OTHER. The relation must be the same as the one specified by ONEELEMENT.

If (4) is true // begin processing the subtrees whose promotions satisfy ONEELEMENT.If combining the subtrees would result in an RST tree with crossing lines, then return.Let REMAININGRSTSUBTREES equal SUBTREES.Remove NUC and OTHER from REMAININGRSTSUBTREES.Create a new subtree by joining NUC and OTHER as specified by ONEELEMENT.Set the RelationValue attribute of this new subtree equal to the heuristic score of the hypothesis used to join these two nodes.Set the TreeValue attribute of this new subtree equal to the heuristic score of the hypothesis used to join these two nodes plus the TreeValue of NUC plus the TreeValue of OTHER.Add this new subtree as the first element of REMAININGRSTSUBTREES.ConstructTree (REMAININGBAGS, REMAININGRSTSUBTREES).If the desired number of trees have been constructed, then return.

End If // End processing the subtrees whose promotions satisfy ONEELEMENT.Do the next element in this bag until there are no elements left to do.

Else // the projections are not found –this bag can therefore not apply in any subsequent permutation

Remove ONEBAG from COPYHYPOTHESESEnd If

Do the next bag in COPYHYPOTHESES until there are no bags left to do.// End the processing of the remaining bags.

Else // HYPOTHESES is empty.Return.

End IfEnd Function ConstructTree

Figure 99 Pseudo-code for constructing RST trees

The trees produced by CONSTRUCTTREE are stored in a list. The

TreeValue attribute of the root node of each tree can be used to evaluate a tree;

since the TreeValue attribute is determined by adding the heuristic scores of the

relations used to construct the tree, a tree constructed by using relations with

high heuristic scores will have a greater TreeValue than a tree constructed by

using relations with low heuristic scores. Ideally, CONSTRUCTTREE ought to

produce highly ranked trees produced before low ranked ones. Unfortunately,

CONSTRUCTTREE occasionally produces trees out of sequence. To correct this

anomalous situation, the list of trees produced by CONSTRUCTTREE is sorted

according to the TreeValue attribute of the root node of each tree, to ensure that

a tree judged by an RST analyst to be the preferred analysis for the text occurs

as the top ranked tree, with alternative plausible analyses also occurring near

the top of the sorted list.

Why then does CONSTRUCTTREE occasionally produce trees out of

sequence? Consider the following hypothetical example of seven relations

grouped into three bags: bag A, containing the three relations {a1 a2 a3}, bag B,

containing {b1 b2}, and bag C, containing {c1 c2}. These bags are illustrated in

Table 1. Heuristic scores associated with the relations are given in parentheses.

A B Ca1 (15) b1 (10) c1 (5)a2 (7) b2 (5) c2 (3)a3 (2)

Table 1 Three bags of hypotheses

Given these hypothesized relations, RASTA would first order the bags into the

list {A, B, C}, and would then apply relations from each bag as illustrated in

Table 2.

Iteration Hypotheses1 a1 (15) b1 (10) c1 (5)2 a1 (15) b1 (10) c2 (3)3 a1 (15) b2 (5) c1 (5)4 a1 (15) b2 (5) c2 (3)5 a2 (7) b1 (10) c1 (5)6 a2 (7) b1 (10) c2 (3)7 a2 (7) b2 (5) c1 (5)8 a2 (7) b2 (5) c2 (3)… … … …

Table 2 Permutations of hypotheses

As Table 2 shows, during the fifth and sixth iterations, CONSTRUCTTREE

applies relation a2, whose heuristic score is 7, before relation b1, whose

heuristic score is 10. Applying the relations in this order could lead to an RST

tree with a lower overall score than one that would be produced by applying

relation b1 before relation a2. This might suggest that a more elaborate method

is needed for permuting the relations in order to ensure that trees are never

produced out of sequence. In practice, however, producing trees by applying

hypotheses out of sequence does not create problems, for the following reasons:

1. Applying the hypotheses out of sequence often does not lead to

well-formed RST trees in any case.

2. The final list of valid RST trees is sorted according to the TreeValue

attribute of the root node, thus ensuring that the trees are correctly

ordered.

3. The primary focus of this research is on the top-ranked RST tree.

The exact ordering of other trees is not of paramount importance.

The heuristic scores of the hypothesized relations are continually

being adjusted (section 7.5.5) to ensure that the preferred analysis is

produced within the desired number of trees. If the preferred

analysis occurs within the desired number of trees and the optimal

heuristic scores are used, the preferred analysis will percolate to the

top of the list during the sorting phase mentioned in (2).

7.5.4 Produce n-ary branching trees

The trees produced by CONSTRUCTTREE (section 7.5.3) are binary-

branching, whereas (with the exception of Matthiessen and Thompson (1988))

n-ary branching trees have generally been proposed in the RST literature. The

fact that n-ary branching trees are preferred over binary-branching ones in the

RST literature is not the only motivation for producing n-ary branching trees –

preliminary experiments suggest that n-ary branching trees are more useful for

producing summaries of texts by a method that prunes satellites in an RST tree.

For research purposes, RASTA produces as many binary-branching trees as

desired, and then transforms the top ranked tree into an n-ary branching tree.

The n-ary branching tree is derived from a binary-branching tree by means of a

simple tree traversal.

The derivation of an n-ary branching tree from a binary-branching one

relies crucially on the notion of nuclearity (section 4.2). For example, if two

text spans a and b are in a symmetric relation R to form an RST node N, then

the promotion set of the resulting node consists of the union of the promotion

set of a and the promotion set of b (section 7.5.1). A node c can only be

relation R to the node N created by a and b if the relation R can be plausibly

hypothesized to hold between all members of the promotion set of N and all

members of the promotion set of c. This then amounts to saying that relation R

holds between any two members of the set that results from taking the union of

the promotion sets of a, b, and c. This is illustrated in Figure 100. A structure

with the form illustrated by Figure 100 (a) will be converted into a structure of

the form illustrated by Figure 100 (b), a structure like Figure 100 (c) will be

converted into a structure like Figure 100 (d) and a structure like Figure 100 (e)

will be converted into a structure like Figure 100 (f).

Figure 100 Corresponding binary and n-ary branching symmetric RST

trees

The case of n-ary branching asymmetric relations is similar to that of

symmetric n-ary branching relations, except that binary-branching asymmetric

relations can be transformed into n-ary branching ones irrespective of the

relation that holds between the nucleus and the satellite. Figure 101 illustrates

transformations of binary-branching asymmetric RST trees into n-ary branching

RST trees. In Figure 101 (a), for example, node 1 is in a Circumstance if and

only if node 1 is in a Circumstance relation with node 3, since node 3 is the

single member of the promotion set of the text span covering nodes 2 through

3. Since both node 1 and node 2 are in a dependency relationship to node 3, the

binary-branching tree can be transformed into an n-ary branching structure like

that in Figure 101 (b). Similarly, the structure represented in Figure 101 (c) can

be transformed into the n-ary branching structure represented in Figure 101 (d).

Figure 101 Corresponding binary and n-ary branching asymmetric RST

trees

Figure 102 illustrates the transformation of a complex tree involving

symmetric and asymmetric relations.

Figure 102 Corresponding binary and n-ary branching complex RST trees

The function BinaryToNaryTree performs a depth-first traversal of an

RST tree, converting binary-branching structures to n-ary branching ones as it

returns to the root node. The pseudo-code for BinaryToNaryTree is given in

Figure 103.

Function BinaryToNaryTree (CURRENTNODE)Begin Function BinaryToNaryTree

If CURRENTNODE is not a terminal nodeBinaryToNaryTree (Nucleus(CURRENTNODE)) // Process the subtree inside the nucleusIf CurrentNode has a co-nucleus

BinaryToNaryTree (CoNucleus). // Process the subtree inside the co-nucleusIf CURRENTNODE is the root of a structure like those illustrated in Figure 100 (a), (c) or (e) then

Transform the structure into its n-ary branching counterpart.End If

Else If CURRENTNODE has a satelliteBinaryToNaryTree (Satellite); // Process the subtree inside the satellite.If CURRENTNODE is the root of a structure like the one illustrated in Figure 101 (a), i.e. a satellite modifying a subtree that contains a satellite modifying a subtree, then

Transform the structure into its n-ary branching counterpart.End If

End If // Does CurrentNode have a co-nucleus or a satellite?Else If CURRENTNODE is a terminal node

Return.End If

End Function BinaryToNaryTree

Figure 103 Pseudo-code for the function BinaryToNaryTree

7.5.5 Learning the heuristic scores

The heuristic scores presented in this study were derived by trial and

modification. The initial values used were based on the author’s intuitions as a

linguist. For example, conjunctions are extremely good discriminators of

particular discourse relations, whereas tense and aspect are weaker

discriminators. These initial heuristic scores were then modified to ensure that

preferred trees occurred at the top of the ranked list of RST trees.

In the course of my research I have been constructing a regression test

set. This is a file containing excerpts from Encarta together with their preferred

RST analyses. New data necessitate changes in RASTA. These new changes can

always be checked to determine whether they would prevent RASTA from

producing the preferred analyses in the regression test set. Very often, new data

suggest a new cue to a discourse relation, or a modification to an existing cue.

Associated with the new or modified cue is a heuristic score, which must be

adjusted until RASTA produces preferred analyses for both the new data and the

members of the regression test set.

Researchers developing grammars have access to annotated corpora

such as the Penn Treebank (Marcus et al. 1993). Such sources provide

externally verified analyses of part of speech and constituency, and are

invaluable for those desiring to evaluate grammars or to train grammars that

involve machine learning or a statistical component. Given a similar corpus of

texts annotated with RST analyses, it ought to be possible to automatically learn

the optimal values for the heuristic scores of the discourse cues. Unfortunately,

no widely available corpora of RST-analyzed texts exist. Hand-tuning in order

to determine the optimal heuristic scores is therefore still necessary.

Although the heuristic scores given in this dissertation suffice to

produce the desired RST analyses, it is possible that the actual scores are not

optimal. Since the heuristic scores guide RASTA to produce better trees first, it

might be possible to find a different set of heuristic scores that causes RASTA to

produce the same preferred analyses but to do so more quickly. An automated

learning algorithm could therefore test different heuristic scores for each cue

against the current regression files in order to determine whether a better set of

scores exists than the one currently in use. Since the space to be searched for a

better set of heuristic scores is so large (for some fifty heuristic scores with

possible values in the range 1-50, there are 5050 possible vectors to be tested), I

first measured the performance of the current set of heuristic scores. For the

tree that eventually emerged as RASTA’s number one choice, a note was made

of the order in which that tree had originally been produced, for example, the

eventual number one ranked tree was actually the third tree produced. Figure

104 gives the results for one regression test set, consisting of 59 excerpts. The

NthTree row gives the order in which the tree eventually ranked as number one

was produced. The Total Trees row gives the corresponding total number of

well-formed RST trees constructed for the excerpt. RASTA was instructed to

produce up to one thousand trees for each excerpt. For example, the first

excerpt in the regression set yielded twelve RST trees. The fifth tree produced

was the one with the highest overall score, and so when the trees were sorted it

ended up as the number one ranked tree.

NthTree 5 1 1 1 1 1 3 1 1 1 1 3 3 1 1 1 1 1 1 34Total Trees 12 1 1 1 1 1 3 1 1 1 1 4 3 1 1 1 1 1 1 73

NthTree 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 9Total Trees 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 3 2 14

Nth Tree 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Total Trees 1 1 1 2 12 1 1 1 1 1 1 3 1 1 1 1 1 1 1

Figure 104 Rankings of RST trees

For this regression set, I calculated some simple measures of the

performance of RASTA. The average value of the NthTree attribute was 1.92,

with a standard deviation of 4.39. With the exception of a single pathological

case (NthTree = 34), the number one ranked tree was produced within the first

nine trees.

The most common result of this experiment was that there was only a

single tree possible. Producing the number one ranked tree first when there is

only one possible tree might not appear to be an interesting result. In fact, this

result represents a ringing endorsement of RASTA. The number of possible trees

grows in proportion to the number of relations that RASTA hypothesizes. One

research goal is therefore to impose stringent conditions so that relations are

hypothesized where appropriate but not hypothesized too liberally, i.e. the

search for plausible analyses is often best constrained at the very earliest stages

of processing, when relations between nodes are being hypothesized.

Although there might exist a set of heuristic scores that would cause

RASTA to produce the same number one ranked trees but to do so more quickly,

the search for those numbers, even by automated means, is likely to produce

only marginal improvements in the system. This is perhaps not surprising. As a

linguist identifying cues to discourse structure, the author is able to rely on

intuitions as both a linguist and a native speaker of English to determine likely

ranges of values for the heuristic score associated with a cue. For example,

conjunctions associated with an RST relation are intuitively strong indicators.

We can therefore test a high initial value, for example 25, and determine the

effects. In some cases, the heuristic score of the new hypothesis or the scores of

existing hypotheses might have to be modified in order to achieve the preferred

analysis for the excerpt that motivated the new cue. The new cue and its

associated heuristic score are then tested to ensure that the interaction of the

new cue with other cues in the system does not cause any texts that were

previously analyzed correctly to now be analyzed incorrectly, i.e. to ensure that

no texts in the regressions set are affected. While the search space within which

an ideal set of heuristic scores might be located is incredibly large, in practice

the space to be searched during the development of RASTA is extremely small,

since the space is constrained by the linguist’s intuitions. Since the automated

search for an optimal set of heuristic scores promises only marginal

improvements, it is perhaps best left for future research, once a great number of

RST analyses have been constructed and verified as appropriate for training a

learning algorithm.

7.6 Worked example

Let us now turn to a close examination of the operation of RASTA by means of

a worked example. The text in Figure 105 forms the basis of the worked

example.






69.For example, the aardwolf has five toes on its forefeet,

70.whereas the hyena has four.

Figure 105 Aardwolf

The syntactic analyses and logical forms produced for the sentences in

this excerpt are given in Figure 106 to Figure 108.

Figure 106 Analysis of the first sentence

Figure 107 Analysis of the second sentence

Figure 108 Analysis of the third sentence

Figure 109 Analysis of the fourth sentence

The node labelled DUMMY in the logical form given in Figure 109

represents the unresolved elliptical head of an NP. Verb phrase anaphora is

handled well in MEG; NP anaphora however is still under development.

These syntactic parses contain a few minor errors. In Figure 108, for

example, the phrase because of certain anatomical differences between the

aardwolf and the hyena ought to be dependent on the VP whose head is the

verb place. In Figure 107, the logical form label MANR is not correct. Finally,

the classification Proteles cristatus ought to have been identified as a single

constituent, namely a noun-noun compound. Despite these minor errors in the

syntax and logical form, RASTA is still able to posit plausible representations of

the structure of this excerpt.

Given these parses and logical forms, RASTA is able to identify the five

clauses given in Figure 105 as being terminal nodes in an RST analysis. RASTA

then examines all pairs of clauses in this excerpt to produce the hypothesized

discourse relations given in Figure 111. These hypothesized relations are then

grouped into bags of mutually exclusive relations. Relations within bags are

ranked in descending order of their heuristic score. Finally, the bags are ranked

in descending order according to the heuristic score of their initial elements.

For this example, the hypothesized relations 2 and 3, which concern relations

joining clauses 1 and 3, are grouped together into a single bag. Other pairs of

clauses yielded only a single hypothesized relation or no hypothesized

relations. The bags are given in Figure 110. Note that the relative order of Bag1

and Bag2 is arbitrary, since both bags contain a single hypothesized relation

with a score of 35.

Bag # Relation number (from Figure 111) and score

1 4: Score = 35

2 5: Score = 35

3 6: Score = 30

4 1: Score = 27

5 2: Score = 25; 3:Score = 20

Figure 110 Bags for the excerpt

# Name Clauses Cues and bases for cues Total

1 ELABORATION 1, 2 H25a: Usually in Clause2.

H25: The clauses are not

coordinated and they exhibit subject

continuity since it is coreferential

with The aardwolf.

27

2 CONTRAST 1, 3 H4: However in Clause3. 25

3 ELABORATION 1, 3 H38: The syntactic subject of

Clause3 is modified by some.

H25: Clause1 is passive and the Dobj

of Clause1 has the same head as the

Dobj of Clause3 (aardwolf).

20

4 CONTRAST 2, 3 H39: The two clauses have the same

main verb.

H4: Clause3 contains however.

35

5 ELABORATION 3, 4 H24: Clause4 contains for example

and is the sentence immediately

following Clause3.

35

6 ASYMMETRIC-

CONTRAST

4, 5 H20: Clause5 contains whereas. 30

Figure 111 Hypothesized relations for the excerpt

RASTA occasionally makes reference to the original list of hypothesized

relations. This original list is called ORIGINALHYPOTHS in the discussion

below.

Each of the clauses identified as a terminal node initially has a single

projection, the clause itself. RASTA thus begins with the terminal nodes given in

Figure 112. (In the diagrams in this section, projections are written in curly

braces.) This list of nodes is referred to below as RSTNODES.

Figure 112 Terminal nodes and initial projections

RASTA begins with bag 1, and attempts to apply the first hypothesized

relation, relation 4. This relation specifies a CONTRAST relation between clause

2, It is usually placed in the hyena family, Hyaenidae, and clause 3, Some

experts however, place the aardwolf in a separate family, Protelidae, because

of certain anatomical differences between the aardwolf and the hyena. RASTA

searches RSTNODES for a node whose projections include clause 2 and a node

whose projections include clause 3. RASTA finds these two nodes. RASTA

removes the nodes from RSTNODES, and combines them to form a new node

covering clauses 2 and 3, and adds this new node back into RSTNODES.

RSTNODES now contains the elements given in Figure 113.

Figure 113 Contents of RSTNODES after applying hypothesis 4

RASTA now permutes the other bags, i.e. bags 2, 3, 4, 5. In the first

permutation, the first bag is bag 2. RASTA attempts to apply the first

hypothesized relation in bag 2, hypothesis 5, which specifies an ELABORATION

relation with clause 3, Some experts however, place the aardwolf in a separate

family, Protelidae, because of certain anatomical differences between the

aardwolf and the hyena, as the nucleus and clause 4, For example, the aardwolf

has five toes on its forefeet, as the satellite. RASTA searches in RSTNODES for a

nodes whose projections include clause 3 and a node whose projections include

clause 4. Nodes with these projections are found in RSTNODES. The node

whose projections include clause 3, the CONTRAST node resulting from the

application of the first hypothesis in bag 1, also includes clause 2 in its

projections. RASTA can only attach clause 4 as a satellite of this node if

ORIGINALHYPOTHS includes an ELABORATION relation with clause 2 as a

nucleus and clause 4 as a satellite. Since no such relation was hypothesized, it

does not occur in ORIGINALHYPOTHS. RASTA is therefore unable to attach

clause 4 as a satellite of this node.

If bag 2 contained more hypothesized relations, RASTA would at this

stage move on to consider them. Since bag 2 only contains a single relation,

RASTA has completed processing of the current bag and moves on to bag 3.

The first hypothesized relation in bag 3, relation 6, specifies an

ASYMMETRICCONTRAST relation, with clause 4, For example, the aardwolf has

five toes on its forefeet, as the nucleus and clause 5, whereas the hyena has

four, as the satellite. RASTA finds nodes whose projections include these two

clauses and creates a new node covering clauses 4 and 5, as illustrated in Figure

114.


RASTA now permutes the other bags, i.e. bags 2, 4, 5. In the first

permutation, the first bag is bag 2. As noted above, bag 2 contains a single

hypothesized relation that cannot be applied, despite the presence of the

projections specified by the relation. RASTA therefore moves onto bag 4,

applying relation 1. Relation 1 specifies an ELABORATION relation with clause

1, The aardwolf is classified as Proteles cristatus, as the nucleus and clause 2,

It is usually placed in the hyena family, Hyaenidae, as the satellite. Nodes with

the requisite projections are found. Clause 2 occurs in a node with another

projection, clause 3. Since ORIGINALHYPOTHS contains an ELABORATION

relation, with clause 1 as the nucleus and clause 3 as the satellite, RASTA

constructs a new node covering clauses 1 through 3, as illustrated in Figure

115.


RASTA now permutes the other bags, i.e. bags 2 and 5. In the first

permutation, the first bag is bag 2. In RSTNODES, RASTA is unable to find the

two projections that the hypothesized relations in bag 2 cover, namely clauses 3

and 4. RASTA therefore prunes all nodes in the search space that follows from

the current permutation by removing bag 2 from further consideration. In this

particular example, bag 2 contains a single hypothesis and the removal of bag 2

leaves only a single bag, bag 5. Frequently15, however, a bag is removed and

several bags remain. One of these remaining bags is removed and so on, with

the result that the search space is considerably reduced. Measurements of

15 In cases where a trace of the program execution would span too many pages for

any human reader to endure.

RASTA’s execution indicate that pruning the search space reduces the number of

passes through the loop that moves from one bag to the next by approximately

one third.

RASTA now moves on to consider bag 5. As with bag 2, RASTA is not

able to find both projections specified by the hypothesized relations in bag 5.

RASTA therefore removes bag 5 from further consideration. Since no bags now

remain, RASTA backs up to the point illustrated in Figure 114, and continues

processing. Eventually, after RASTA has pursued other dead ends, RSTNodes

contains the two nodes illustrated in Figure 116.

Figure 116 Contents of RSTNODES after further processing

RASTA then attempts to apply hypothesized relation 1 from bag 4. This

relation specifies an ELABORATION relation with clause 1, The aardwolf is

classified as Proteles cristatus, as the head and clause 2, It is usually placed in

the hyena family, Hyaenidae, as a satellite. Both clause 1 and clause 2 are

available in the projections of nodes in RSTNODES. Clause 2 occurs as the

projection of a node whose projections also include clause 3, Some experts

however, place the aardwolf in a separate family, Protelidae, because of

certain anatomical differences between the aardwolf and the hyena. Because

ORIGINALHYPOTHS also includes an ELABORATION relation with clause 1 as the

nucleus and clause 3 as the satellite, RASTA joins clause 1 and the CONTRAST

node that covers clauses 2 through 5. RSTNODES now contains a single node.

This single node is an RST tree covering clauses 1 through 5, as illustrated in

Figure 117.

Figure 117 First complete RST tree for Aardwolf excerpt

As a final stage, RASTA converts the binary-branching tree to an n-ary

branching tree. For this particular tree, the result of this conversion is a tree

with exactly the same form as the tree in Figure 117.

The tree produced first for this excerpt happens to be the one that I

consider to be the preferred analysis for this text. If left to run, however, RASTA

produces other analyses. The tree given in Figure 117 has an overall score of

127. All the other trees produced by RASTA have scores less than 127. Since

RASTA sorts the trees according to their overall score, no subsequent tree ousts

the tree in Figure 117 from its number one position.

8. RASTA’s contributions to the field

8.1 Introduction

RASTA is a discourse analysis module that efficiently constructs RST

trees to represent the structure of written texts. Having presented in chapters 6

and 7 the processes by which RASTA constructs these representations, let us

turn to a consideration of the practical and theoretical implications of the

approach adopted in this dissertation.

8.2 Identifying rhetorical relations

There is a spectrum of opinion in the discourse literature concerning the

manner in which rhetorical relations might be identified. At one extreme are

those who attempt to identify rhetorical relations primarily on the basis of cue

words and phrases (Sumita et al. 1992; Ono et al. 1994; Kuohashi and Nagao

1994; Marcu 1997a). At the other end of the spectrum are those who make

unrestricted appeals to knowledge extrinsic to the text (Hobbs 1979; Polanyi

1988). An agnostic middle view is articulated by Mann and Thompson (1986),

who concede that the form of a text sometimes correlates with RST relations but

ask:

“Are there other, more subtle attributes of form in text which

might be signaling the relations in the absence of conjunctions

or subordinators? For example, could the sequence of

declarative mood followed by imperative mood be signaling

“solutionhood”…? We doubt that there are such signals

expressed in text form on several grounds… Whatever other

signals there are, they must be derivable from large units of texts

as well as from single sentences, but large units of text can have

very diverse forms. We recognize that such patterns can be

suggestive of a relation, or perhaps restrict the range of possible

relations, but we do not believe that there are undiscovered

signal forms, and we do not believe that text form can ever

provide a definitive basis for describing how relational

propositions can be discerned” (Mann and Thompson 1986:71-

72)

RASTA is motivated by a functional perspective on language: a writer

manipulates elements of form (morphology, syntax, lexical choice) to achieve a

desired effect, including a desired rhetorical effect. Therefore, strewn

throughout a text we can expect to find cues to the writer’s rhetorical

intentions. As Mann and Thompson observe, there might not be a one-to-one

relationship between these cues and rhetorical relations. RASTA’s use of cues

allows for a less direct mapping between elements of form and rhetorical

relations: a given cue can indicate several relations, and a given relation can be

identified by the convergence of a cluster of cues. Furthermore, the notion that

the correspondence between formal cues and rhetorical relations is probabilistic

rather than deterministic is encoded in RASTA in the numerical weights

associated with the cues.

RASTA draws on a rich syntactic analysis in considering formal cues,

giving it an advantage over more superficial analyses of texts (e.g. Sumita et al.

1992; Ono et al. 1994; Marcu 1997a). The identification of terminal nodes for

an RST analysis, for example, which proves so difficult for a simple pattern-

matching method (section 4.2.6), is a simple affair given a syntactic analysis

and criteria based on that analysis (section 7.3). Similarly, confusion about

whether to analyze some strings as cue phrases treated as single lexical items or

as segments with internal constituency is resolved by a full syntactic analysis

(section 4.2.6). Finally, this rich syntactic analysis allows RASTA to make

surprisingly powerful deductions about syntactic correlates of rhetorical

relations. For example, a detached participial clause is usually in a

CIRCUMSTANCE relation to the main clause if it is preposed (cue H13, section

6.7.3), as illustrated in example (1), but in a RESULT relation if postposed (cue

H22, section 6.7.12), as illustrated in example (2).

(1) Leaving port on October 19 and 20, Villeneuve’s fleet was

intercepted by Nelson’s fleet on the morning of October 21.

(Trafalgar, Battle of)

(2) This bold strategy created confusion, giving the British fleet an

advantage. (Trafalgar, Battle of)

During syntactic analysis, MEG makes use of a semantic network to

resolve ambiguous syntactic dependencies (section 5.3.2). RASTA, however,

does not make any additional use of a semantic network or other external form

of knowledge representation (see section 8.3). Rather, RASTA performs its

analyses based solely on its examination of the syntactic portrait and the logical

form. RASTA is thus located at a mid-point in the spectrum mentioned above,

making use of more than cue phrases to identify rhetorical relations, but not

making reference to extrinsic knowledge.

Finally, it must be emphasized that RASTA only needs to discriminate

between a small set of relations. The cues listed in chapter 6 do not constitute

an exhaustive list of the lexical, morphological, and syntactic correlates of the

RST relations, but rather a sufficient set of criteria for distinguishing the

relations.

8.3 Representations of knowledge

In some approaches to identifying rhetorical relations, external

representations of knowledge play a crucial role. Hobbs, for example, envisages

a system that would encode “those things a speaker of English generally knows

and can expect his listener to know” (Hobbs 1979:71). Similarly Polanyi’s

(1988) Linguistic Discourse Model (section 4.5) relies crucially on amorphous

real-world knowledge and unspecified inferential processes. There is no doubt

that people draw on vast resources about events and entities or knowledge of

genre conventions in addition to formal cues to discourse structure. Any

attempt to mimic current understanding of the nature of those additional

resources is, however, likely to involve great computational expense and

complexity. Although MEG contains an enormous semantic network, MINDNET

(section 5.7), RASTA does not make reference to it. A compelling reason for

RASTA to eschew MINDNET is that it is computationally expensive to reason

using such a vast resource. A more theoretical motivation, however, concerns

the nature of rhetorical relations. A writer structures a text to express intended

rhetorical relations. The relations that the writer actually chooses might not be

those that would be said to exist in the abstract. For example, reasoning in the

abstract might suggest that a causal relationship could be hypothesized between

an event of sneezing and an event of dying: sneezing is a means by which fatal

diseases can be communicated, so one person’s sneezing might cause another

person’s death. A writer describing two such events might choose to emphasize

a causal relationship. Alternatively, the writer might choose to represent the

events as merely occurring in a temporal sequence, without emphasizing

causality. Similarly, a writer might choose to indicate a causal relationship

between two events that reasoning with an external relationship would not

suggest were causally related. By restricting its analyses to what is motivated

by the text, RASTA avoids spurious reasoning.

In some cases, a simple examination of MINDNET might appear to

suffice to identify a rhetorical relation. Mann and Thompson (1988:273)

observe that in an ELABORATION relation, there is often one of the following

relations between an element in the nucleus and an element in the satellite:

set/member, abstract/instance, whole/part, process/step, object/attribute,

generalization/specific. A whole/part relation holds between clauses 3 and 4 in

Figure 118, as can be deduced by the following chain of reasoning: aardwolves

are animals; animals have bodies; bodies have forefeet; forefeet have toes;

therefore aardwolves have toes. Although a chain of reasoning like this could

easily be performed using MINDNET, the presence of the cue phrase For

example obviates such elaborate reasoning.





between the aardwolf and the hyena

73.For example, the aardwolf has five toes on its forefeet…

Figure 118 Aardwolf

If future research should exhaust the possibilities for identifying

superficial cues to discourse structure, it may well prove necessary to examine

MINDNET to identify unclear rhetorical relations. Should that prove necessary,

it would be desirable to constrain the use of MINDNET, examining it in ways

suggested by the linguistic form.

8.4 Constructing and evaluating trees

As discussed above in section 4.2.6 and section 4.2.5, there are two

major problems for a discourse component attempting to construct RST

representations of the structure of a text:

1. How can combinatorial explosion be avoided? As more and more

relations are posited, the number of well-formed RST trees

compatible with those relations increases exponentially.

2. How can alternative analyses be evaluated?

Marcu (1996, 1997a), the most complete description in the literature of

an algorithm for constructing RST representations, does not address (1).

Concerning (2), Marcu suggests a metric that favors right-branching trees. This

metric, however, only appears to be valid for certain genres (section 4.2.6).

In RASTA, the solution to both (1) and (2) lies in the use of heuristic

scores associated with cues to discourse structure. The relations with the

highest heuristic scores are applied first in an effort to construct an RST tree

(chapter 7). Since better trees are produced first, the algorithm does not need to

produce all possible trees, thus avoiding exponential explosion. Finally, the

metric used in RASTA to evaluate trees is independent of genre—the heuristic

score associated with a tree is computed from the heuristic scores of the

relations used in constructing the tree. “Better” trees are those built from better

hypotheses. Although this metric for evaluating trees is independent of genre,

the methods for determining the discourse relations might vary with genre (see

section 8.5).

8.5 Genre

Despite being limited to a single genre, namely encyclopedia articles,

the work described in this dissertation represents a more explicit and complete

implementation of a discourse processing model than any hitherto described in

the literature.

The particular set of relations used in this study (section 3.6) was

motivated by the encyclopedia genre. As noted in section 3.6, articles in

Encarta are primarily concerned with the organization of information

according to ideational and textual relations subordinated to an overarching

speech act like DESCRIBE or EXPLAIN. For other genres, a slightly different set

of relations might be needed. The effectiveness of the techniques employed in

RASTA for identifying rhetorical relations lead us to expect that linguistic cues

could be identified for similar relations in other genres. The methods for

constructing an RST tree, given a set of hypothesized relations, are not

constrained in any way by genre.

Perhaps the biggest stumbling block to applying RASTA to other genres

is the potential clash discussed in section 3.6 between intentional and

informational representations of a text (Ford 1986; Moore and Pollack 1992).

This obstacle could perhaps be overcome by constructing alternative analyses.

From the perspective of Systemic Functional Grammar (Halliday 1985), for

example, a text could be said to have both informational and interpersonal

aspects. These different aspects could be modeled independently. In many cases

the resulting analyses would converge in identifying discourse constituents.

Systematic divergences would constitute a rich area for future study on the

nature of discourse.

Finally, in this study only small excerpts have been considered,

typically a paragraph in size or smaller. This limitation has been for

performance reasons and because it is easier for a human analyst to verify the

analyses produced. It is claimed within RST that the same analytical framework

can be applied equally well to excerpts of one or two sentences as to much

larger excerpts.

9. Potential Applications for RASTA

9.1 Introduction

Given the ability to automatically construct plausible representations of

discourse structure, many exciting areas of research become possible. In this

chapter I make brief mention of a few areas: text summarization, the creation of

semantic networks, information retrieval, and the quantitative analysis of

discourse patterns.

9.2 Text summarization

Mann and Thompson (1988:266-268), in discussing the notion of

nuclearity, note that deletion of nuclear text spans will tend to make a text

incoherent, but deletion of satellite text spans does not result in such

incoherence. They claim

“If units that only function as satellites and never as nuclei are

deleted, we should still have a coherent text with a message

resembling that of the original; it should be something like a

synopsis of the original text. If, however, we delete all units that

function as nuclei anywhere in the text, the result should be

incoherent and the central message difficult or impossible to

comprehend.” (Mann and Thompson 1988:267-268)

Human-generated summaries frequently involve not only the deletion of

material but also a reformulation of the content, paraphrasing the output to

maintain coherence. Such reformulation and paraphrasing are difficult to

implement within computer natural language generation systems. A method

that would produce a reasonable, coherent synopsis of a text simply by omitting

less nucleic material therefore holds considerable appeal. Ono et al. (1994)

sketchily describe an RST-based summarization method whose base level units

are sentences. Marcu (1997b) provides a full description of a text summarizer

that consists of a simple tree-traversal algorithm that prunes nodes from an RST

tree. Nodes that are not pruned constitute the summary. Marcu claims that his

method has a granularity at the level of the clause, although in fact many of his

terminal nodes are not clauses (see section 4.2.6). The algorithm that Marcu

describes would work equally well on RST trees that did not have any non-

clausal terminal nodes. In particular, Marcu’s algorithm would work well with

the output of RASTA.

I am currently experimenting with a prototype summarizer that

manipulates the output of RASTA. This prototype summarizer performs a tree

traversal in the same manner as Marcu’s algorithm, but instead of deleting

nodes, it presents nodes in a nested form. The output is presented in a hypertext

format, allowing a reader to selectively expand nodes to yield more detail.

Figure 119 illustrates a hypertext view within the Microsoft Word 97

wordprocessor of the Abd-ar-Rahman excerpt discussed in section 7.5.1. In

Figure 119, the reader has decided to expand the text subordinate to the third of

the narrative clauses, His army met the Franks…. The plus sign beside a node

indicates that that a node can be expanded, i.e. that that node has one or more

satellites. A minus sign indicates that a node cannot be expanded, either

because its satellites are already displayed or because it does not have any

satellites. In this example, there are no instances of satellites with the same

relation to a nucleus being grouped together.

Figure 119 Hypertext view of Abd-ar-Rahman text

Satellite nodes that are in the same rhetorical relation to a nucleus could

be grouped together, thereby imposing additional structure on the output.

Figure 120 gives the RST analysis for a small excerpt.

1. The acute form of conjunctivitis is commonly called pinkeye.

74. It can be caused by either bacterial or viral infection

75.and is often epidemic.

76. In newborn babies it may result from several kinds of cocci,

especially the gonococcus (gonorrheal conjunctivitis), or from

a strain of the parasitic bacterium Chlamydia trachomatis

(inclusive conjunctivitis).

Figure 120 Conjunctivitis

In the excerpt illustrated in Figure 120, terminal nodes 2 and 4 are both

in a cause relation to node 1. These discontiguous nodes could be grouped and

displayed under a single heading in a hypertext viewer, as illustrated in Figure

121. The information given in square brackets represents a hypertext node that

a reader could click on to expand it and read the text that it represents. The

hypertext node includes a count of the number of clauses beneath it, to give the

reader an indication of how much content lies beneath. (The description “More

information” has been used as a synonym for the technical term

ELABORATION.)

The acute form of conjunctivitis is commonly called pinkeye.

[Causes:2]

[More information:1]

Figure 121 Hypertext view of conjunctivitis text

Clicking on the hypertext node [Clauses:2] would cause the text of that

node to be displayed, as illustrated in Figure 122.

The acute form of conjunctivitis is commonly called pinkeye.

[Causes:2]

1. It can be caused by either bacterial or viral

infection

2. In newborn babies it may result from several

kinds of cocci, especially the gonococcus

(gonorrheal conjunctivitis), or from a strain of

the parasitic bacterium Chlamydia trachomatis

(inclusive conjunctivitis).

[More information:1]

Figure 122 Hypertext view of conjunctivitis text

Clearly, these methods of displaying a summary of a text to the user in

such a way as to enable the user to explore in more detail areas of interest

require further research and user interface testing.

9.3 The creation of semantic networks

A semantic network like MINDNET (section 5) can be constructed by

automatically extracting information from single sentences in a lexicon (see for

example Jensen and Binot 1987; Klavans et al. 1993; Monetmagni and

Vanderwende 1993; Dolan 1995; Dolan et al. 1993; Richardson 1997;

Vanderwende 1995a, 1995b). By applying domain-specific rules to extract

information from a discourse representation, similar information could be

extracted from extended text. For example, in a description of an animal in an

article in Encarta 96 there is often a section which lists the body parts of the

animal, a section on reproduction and a section on the animal’s life cycle.

9.4 Information retrieval

The field of information retrieval is dominated by statistical techniques,

which rate documents according to how closely the words in them match words

in a search query. Frequently, however, documents selected by these statistical

techniques as containing relevant words with statistically interesting

frequencies are not about the topic described by the search query. These ratings

could be improved by biasing the statistical weighting in favor of material

which occurs in more nucleic sections of text, since nucleic material is most

centrally involved in realizing the writer’s communicative goals (section 3.2).

A reliable discourse processing component could also be used during

the display of the documents returned in response to a search query to highlight

the section of the document which is most relevant to the database query.

Rather than using crude techniques like displaying the text that occurs two or

three lines before and after key terms from the search query, it would be

possible to display the text of the coherent RST subtree that contains those key

terms.

9.5 Quantitative analysis of discourse patterns

For the discourse linguist, perhaps the most exciting potential use of a

computational discourse analysis component is to enable further study of

discourse itself. The labor required to perform an RST analysis of a text is a

serious impediment to research that would take RST analyses as the basis for

higher-level generalizations. If this tedium could be relieved by an automated

computational analysis, the linguist would be freed to consider issues such as

the correlation of discourse structure with genre, the frequency of specific

rhetorical relations, depth of embedding, and anaphoric usage, to name but a

handful of potential areas of study.

Quantitative results from such study could be used to improve the

efficiency of the computational discourse analysis component itself. Richardson

(1994) describes a technique for improving the performance of a rule-based

syntactic sentence parser. During training, the parser is run over many

sentences, gathering statistics about the rules that ultimately resulted in good

parses. The parser then incorporates the results of the training session. The

syntactic rules that most often led to good parses are applied first, causing a

demonstrable improvement in the performance of the parser in converging on

good parses. In a similar fashion, RASTA could eventually be trained by

automatically processing many excerpts and gathering statistics about preferred

relations and observing the most common configurations. The information thus

obtained could be used after training to guide RASTA in much the same way as

the intuitively formulated restrictions on “thinking flow” of Sumita et al.

(1992).

10. Conclusion

This dissertation has described RASTA, a discourse processing

component that computes representations of the structure of written discourse.

RASTA builds on previous research in the field of computational discourse

analysis within the framework of RST, most notably Marcu (1996, 1997a).

RASTA directly addresses the problem of combinatorial explosion—as

more rhetorical relations are hypothesized as connecting two clauses, the

number of well-formed RST analyses for a text increases exponentially. RASTA

manages this combinatorial explosion by assigning heuristic scores to the

relations hypothesized and using those scores to guide it in constructing trees.

More likely hypotheses are applied first in a bottom-up algorithm that links

together contiguous text spans. Those same heuristic scores provide a genre-

independent method for evaluating trees—better trees are those that were

formed by the application of more likely rhetorical relations.

In this dissertation, the actual formal cues used by RASTA to identify

discourse structure have been described. Cue words and phrases are an

important source of information in RASTA, as they would no doubt be in any

discourse analysis component. RASTA is however unusual in the field of

discourse research in the extent to which it is able to recognize cues to

discourse structure by analyzing syntactic analyses and logical forms.

A relatively uncontroversial set of thirteen rhetorical relations has

proven sufficient for the analysis of articles in Encarta. The techniques for

identifying relations would still be applicable, however, if a slightly different

set of rhetorical relations were used. The efficient techniques for constructing

RST trees on the basis of a set of hypothesized rhetorical relations would not

require any modification should a different set of relations be used. The issue

of a suitable taxonomy of discourse relations, so central to work in natural

language generation (see section 3.5), was found to be unimportant for the task

of identifying discourse relations. RASTA is able to reliably discriminate among

the thirteen relations employed using a rudimentary classification of the RST

relations as either symmetric or asymmetric.

For the sake of focusing in depth on issues of efficiency, the research

described here has been limited to the text of Encarta. The search for cues in

the actual text of Encarta, without reference to a semantic network or reference

to models of real world knowledge, has been very successful. In extending

RASTA to other genres, it may prove necessary to use information beyond that

available from the syntactic analysis and logical form. A careful search for cues

in the form of the text in other genres ought however to prove amply

rewarding.

Finally, I have outlined some directions for what is surely developing

into an exciting research area.

Bibliography

ACL = Association of Computational Linguistics

ISI/RS = Information Sciences Institute Report Series

Ballard, D., R. Conrad and R. Longacre. 1971. “The deep and surface

grammar of interclausal relations.” Foundations of Language 4:70-

118.

Dolan, William B. 1995. “Metaphor as an emergent property of machine-

readable dictionaries.” In Proceedings of the AAAI 1995 Spring

Symposium Series, date??, Stanford, California, 1995. 27-32.

Dolan, William, Lucy Vanderwende and Stephen D. Richardson. 1993.

“Automatically deriving structured knowledge bases from on-line

dictionaries.” In Proceedings of the Pacific Association for

Computational Linguistics, April 21-24, 1993, Vancouver, British

Columbia. 5-14.

Ford, Cécilia E. 1986. “Overlapping relations in text structure.” In

DeLancey, Scott and Russell S. Tomlin (eds.), Proceedings of the

Second Annual Meeting of the Pacific Linguistics Conference. 107-

123.

Fox, Barbara A. 1987. Discourse Structure and Anaphora. Cambridge

Studies in Linguistics 48. Cambridge: Cambridge University Press.

Fukumoto, Jun’ichi and Jun’ichi Tsujii. 1994. “Breaking down rhetorical

relations for the purpose of analyzing discourse structures.” COLING

94: The 15th International Conference on Computational Linguistics,

August 5-9, 1994, Kyoto, Japan. Proceedings, vol. 2:1177-1183.

Haiman, John and Sandra A. Thompson, (eds.). 1988. Clause Combining in

Grammar and Discourse. John Benjamins: Amsterdam and

Philadelphia.

Haiman, John. 1980. “The Iconicity of Grammar: Isomorphism and

Motivation.” Language 56:515-540.

Halliday, M.A.K. 1985. An Introduction to Functional Grammar. Edward

Arnold Press: Baltimore.

Halliday, M.A.K. and Ruqaiya Hasan. 1976. Cohesion in English. Longman:

London.

Heidorn, George E. 1972. “Natural language inputs to a simulation

programming system.” Ph.D. dissertation, Yale University. (Also

published as Technical Report NPS-55HD72101A. Naval

Postgraduate School: Monterey.)

Hobbs, J. R. 1979. “Coherence and coreference.” Cognitive Science 3:67-90.

Houghton Mifflin. 1992. The American Heritage Dictionary of the English

Language, Third Edition. Houghton Mifflin: Boston.

Hovy, Eduard H. 1988. Planning Coherent Multisentential Text. ISI/RS-88-

208. Reprinted from Proceedings of the 26th Meeting of the ACL,

Buffalo, New York, 1988.

Hovy, Eduard H. 1990. “Parsimonious and profligate approaches to the

question of discourse structure relations.” In Proceedings of the 5th

International Workshop on Natural Language Generation,

Pittsburgh. 128-136.

Jensen, K. and J.L. Binot. 1987. “Disambiguating prepositional phrase

attachments by using on-line dictionary definitions.” Computational

Linguistics 13:251-260.

Jensen, Karen, George Heidorn and Stephen Richardson (eds.). 1993.

Natural Language Processing: The PLNLP Approach. Kluwer:

Boston/Dordrecht/London.

Klavans, Judith, Martin Chodorow and Nina Wacholder. 1993. “Building a

knowledge base from parsed definitions.” In Jensen, Heidorn and

Richardson (1993). 119-133.

Knott, Alistair and Robert Dale. 1995. “Using linguistic phenomena to

motivate a set of coherence relations.” Discourse Processes 18:35-

62.

Kuhn, H.P. 1958. “The automatic creation of literature abstracts.” IBM

Journal, April 1958:159-165.

Kurohashi, Sadao and Makoto Nagao. 1994. “Automatic detection of

discourse structure by checking surface information in sentences.”

COLING 94: The 15th International Conference on Computational

Linguistics, August 5-9, 1994, Kyoto, Japan. Proceedings, vol.

2:1123-1127.

Labov, William. 1972. Language in the Inner City: Studies in the Black

English Vernacular—Conduct and Communication. Philadelphia:

University of Pennsylvania Press.

Litman, Diane J. and Rebecca J. Passonneau. 1995. "Combining multiple

knowledge sources for discourse segmentation." In Proceedings of

the 33rd Meeting, 26-30 June, Massachusetts Institute of

Technology, Cambridge Massacheutts, USA. Association for

Computational Linguistics. 108-115.

Longacre, R. 1976. An Anatomy of Speech Notions. Ghent: The Peter de

Ridder Press.

Maier, Elisabeth and Eduard H. Hovy. 1991. “A metafunctionally motivated

taxonomy for discourse structure relations.” ms.

Mann, William C. and Sandra A. Thompson 1986. ‘Relational Propositions

in Discourse’. Discourse Processes 9:57-90. Also available as

Information Sciences Institute Research Report 83-115, 4676

Admiralty Way, Marina del Rey, CA 90292-6695.

Mann, William C. and Sandra A. Thompson. 1987. Rhetorical Structure

Theory: A theory of text organization. ISI/RS-87-190.

Mann, William C. and Sandra A. Thompson. 1988. ‘Rhetorical Structure

Theory: Toward a functional theory of text organization.’. Text

8:243-281. (Also published as Mann and Thompson 1987).

Marcu, Daniel. 1996. “Building Up Rhetorical Structure Trees.” In

Proceedings of the Thirteenth National Conference on Artificial

Intelligence, vol. 2. 1069-1074, Portland, Oregon, August 1996.

Marcu, Daniel. 1997a. “The rhetorical parsing of natural language texts.” In

Proceedings of the Thirty-fifth Annual Meeting of the Association for

Computational Linguistics. 96-103.

Marcu, Daniel. 1997b. “From discourse structures to text summaries.” In

Proceedings of the ACL ‘97/EACL ’97 Workshop on Intelligent

Scalable Text Summarization, Madrid, Spain, July 11, 1997. 82-88.

Marcus, Mitchell P., Beatrice Santorini and Mary Ann Marcinkiewicz. 1993.

“Building a large annotated corpus of English: The Penn Treebank.”

Computational Linguistics 19:313-330.

Matthiessen, Christian and Thompson, Sandra A. 1988. “The structure of

discourse and ‘subordination’.” In Haiman and Thompson (eds.).

1988:275-329

McKeown, K.R. 1985. Text Generation: Using Discourse Strategies and

Focus Constraints to Generate Natural Language Text. Cambridge

University Press: Cambridge.

Microsoft Corporation. 1995. Encarta® 96 Encyclopedia. Microsoft:

Redmond.

Montemagni, Simonetta and Lucy Vanderwende. 1993. “Structural patterns

versus string patterns for extracting semantic information from

dictionaries.” In Jensen, Heidorn and Richardson (eds.), 1993. 149-

159.

Moore, Johanna D. and Martha E. Pollack. 1992. “A problem for RST: The

need for multi-level discourse analysis.” Computational Linguistics

18:537-544

Ono, Kenji, Kazuo Sumita and Seiji Miike. 1994. “Abstract generation

based on rhetorical structure extraction.” COLING 94: The 15th

International Conference on Computational Linguistics, August 5-9,

1994, Kyoto, Japan. Proceedings, vol. 1:344-348

Pentheroudakis, Joseph and Lucy Vanderwende. 1993. “Automatically

identifying morphological relations in machine-readable

dictionaries.” Making Sense of Words: Ninth Annual Conference for

the UW Centre for the New OED and Text Research, September 27-

28, 1993. Oxford, England. 114-131.


Reference?

Polanyi, Livia. 1982. “Linguistic and social constraints on storytelling.”

Journal of Pragmatics 6:509-524.

Polanyi, Livia. 1988. “A formal model of the structure of discourse.”

Journal of Pragmatics 12:601-638.

Proctor, P. (ed.). 1978. Longman Dictionary of Contemporary English.

London: Longman Group.

Redeker, Gisela. 1990. “Ideational and pragmatic markers of discourse

structure.” Journal of Pragmatics 14:367-381.

Richardson, Stephen D. 1994. “Bootstrapping statistical processing into a

rule-based natural language parser.” In The Balancing Act:

Combining Symbolic and Statistical Approaches to Language.

Proceedings of the Workshop, Las Cruces, New Mexico. pp 96-103.

Richardson, Stephen D. 1997. Determining Similarity and Inferring

Relations in a Lexical Knowledge Base. Ph.D. dissertation. The City

University of New York.

Richardson, Stephen D., Lucy Vanderwende and William Dolan. 1993.

“Combining dictionary-based methods for natural language

analysis.” In Proceedings of the TMI-93, Kyoto, Japan. 69-79.

Sanders, Ted J.M. 1992. Discourse Structure and Coherence: Aspects of a

Cognitive Theory of Discourse Representation. Lundegem:

Nevelland.

Sanders, Ted J.M. and Carel van Wijk. 1996. “PISA: A Procedure for

analyzing the structure of explanatory texts.” Text 16:91-132.

Sanders, Ted J.M., W.P.M Spooren and L.G.M. Noordman. 1992. “Toward

a taxonomy of coherence relations.” Discourse Processes 15:1-35

Sanders, Ted J.M., W.P.M Spooren and L.G.M. Noordman. 1993.

“Coherence relations in a cognitive theory of discourse

representation.” Cognitive Linguistics 4:93-133.

Sidner, Candace L. 1983. “Focusing and discourse.” Discourse Processes

6:107-130.

Sumita, K., K. Ono, T. Chino, T. Ukita, and S. Amano. 1992. “A discourse

structure analyzer for Japanese text.” In Proceedings of the

International Conference of Fifth Generation Computer Systems,

1992. 1133-1140.

Thompson, Sandra A. 1983. ‘Grammar and Discourse: The English

Detached Participial Clause.’ In Klein-Andreu, Flora (ed.),

Discourse perspectives on syntax. Academic Press: New York. 43-

65.

Vander Linden, Keith. 1993. Speaking of Actions: Choosing Rhetorical

Status and Grammatical Form in Instructional Text Generation.

Ph.D. dissertation. University of Boulder, Colorado. Published as

Technical Report CU-CS-???-93, University of Boulder, Colorado.

Vanderwende, Lucy H. 1995a. The Analysis of Noun Sequences Using

Semantic Information Extracted from On-Line Dictionaries. Ph.D.

dissertation, Georgetown University.

Vanderwende, Lucy H. 1995b. “Ambiguity in the acquisition of lexical

information.” In Proceedings of the AAAI 1995 Spring Symposium

Series, working notes of the symposium on representation and

acquisition of lexical knowledge, 174-179.

Wu, Horng Jyh P. and Steven L. Lytinen. 1990. ‘Coherence relation

reasoning in persuasive discourse.’ In Proceedings of the Twelfth

Annual Conference of the Cognitive Science Society. 503-510.

Zechner, Klaus. 1996. “Fast generation of abstracts from general domain

text corpora by extracting relevant sentences.” In COLING 96: The

16th International Conference on Computational Linguistics, August

5-9, 1996, Copenhagen, Denmark. Proceedings, vol. 2:986-989.

Documents

Computing Representations of the Rhetorical · Web viewComputing Representations of the Structure of Written Discourse Simon Corston-Oliver Rasta (Rhetorical Structure Theory Analyzer),