36
Automatic Verb Classi cation Based on Statistical Distributions of Argument Structure Paola Merlo ¤ Suzanne Stevenson y University of Geneva University of Toronto Automatic acquisition of lexical knowledge is critical to a wide range of natural language pro- cessing tasks. Especially important is knowledge about verbs, which are the primary source of relational information in a sentence—the predicate-argument structure that relates an action or state to its participants (i.e., who did what to whom). In this work, we report on super- vised learning experiments to automatically classify three major types of English verbs, based on their argument structure—speci cally, the thematic roles they assign to participants. We use linguistically-motivated statistical indicators extracted from large annotated corpora to train the classi er, achieving 69.8% accuracy for a task whose baseline is 34%, and whose expert-based upper bound we calculate at 86.5%. A detailed analysis of the performance of the algorithm and of its errors con rms that the proposed features capture properties related to the argument struc- ture of the verbs. Our results validate our hypotheses that knowledge about thematic relations is crucial for verb classi cation, and that it can be gleaned from a corpus by automatic means. We thus demonstrate an effective combination of deeper linguistic knowledge with the robustness and scalability of statistical techniques. 1. Introduction Automatic acquisition of lexical knowledge is critical to a wide range of natural lan- guage processing (NLP) tasks (Boguraev and Pustejovsky 1996). Especially important is knowledge about verbs, which are the primary source of relational information in a sentence—the predicate-argument structure that relates an action or state to its par- ticipants (i.e., who did what to whom). In facing the task of automatic acquisition of knowledge about verbs, two basic questions must be addressed: ° What information about verbs and their relational properties needs to be learned? ° What information can in practice be learned through automatic means? In answering these questions, some approaches to lexical acquisition have focused on learning syntactic information about verbs, by automatically extracting subcategoriza- tion frames from a corpus or machine-readable dictionary (Brent 1993; Briscoe and Carroll 1997; Dorr 1997; Lapata 1999; Manning 1993; McCarthy and Korhonen 1998). ¤ Linguistics Department; University of Geneva; 2 rue de Candolle; 1211 Geneva 4, Switzerland; merlo@ lettres.unige.ch y Department of Computer Science; University of Toronto; 6 Kings College Road; Toronto, ON M5S 3H5 Canada; [email protected] c ® 2001 Association for Computational Linguistics

AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Automatic Verb ClassicationBased on Statistical Distributions ofArgument Structure

Paola Merlocurren Suzanne Stevensony

University of Geneva University of Toronto

Automatic acquisition of lexical knowledge is critical to a wide range of natural language pro-cessing tasks Especially important is knowledge about verbs which are the primary source ofrelational information in a sentencemdashthe predicate-argument structure that relates an actionor state to its participants (ie who did what to whom) In this work we report on super-vised learning experiments to automatically classify three major types of English verbs basedon their argument structuremdashspecically the thematic roles they assign to participants We uselinguistically-motivated statistical indicators extracted from large annotated corpora to train theclassier achieving 698 accuracy for a task whose baseline is 34 and whose expert-basedupper bound we calculate at 865 A detailed analysis of the performance of the algorithm andof its errors conrms that the proposed features capture properties related to the argument struc-ture of the verbs Our results validate our hypotheses that knowledge about thematic relationsis crucial for verb classication and that it can be gleaned from a corpus by automatic meansWe thus demonstrate an effective combination of deeper linguistic knowledge with the robustnessand scalability of statistical techniques

1 Introduction

Automatic acquisition of lexical knowledge is critical to a wide range of natural lan-guage processing (NLP) tasks (Boguraev and Pustejovsky 1996) Especially importantis knowledge about verbs which are the primary source of relational information ina sentencemdashthe predicate-argument structure that relates an action or state to its par-ticipants (ie who did what to whom) In facing the task of automatic acquisition ofknowledge about verbs two basic questions must be addressed

deg What information about verbs and their relational properties needs to belearned

deg What information can in practice be learned through automatic means

In answering these questions some approaches to lexical acquisition have focused onlearning syntactic information about verbs by automatically extracting subcategoriza-tion frames from a corpus or machine-readable dictionary (Brent 1993 Briscoe andCarroll 1997 Dorr 1997 Lapata 1999 Manning 1993 McCarthy and Korhonen 1998)

curren Linguistics Department University of Geneva 2 rue de Candolle 1211 Geneva 4 Switzerland merlolettresunigech

y Department of Computer Science University of Toronto 6 Kingrsquos College Road Toronto ON M5S 3H5Canada suzannecstorontoedu

creg 2001 Association for Computational Linguistics

Computational Linguistics Volume 27 Number 3

Table 1Examples of verbs from the three optionally intransitive classes

Unergative The horse raced past the barnThe jockey raced the horse past the barn

Unaccusative The butter melted in the panThe cook melted the butter in the pan

Object-Drop The boy playedThe boy played soccer

Other work has attempted to learn deeper semantic properties such as selectional re-strictions (Resnik 1996 Riloff and Schmelzenbach 1998) verbal aspect (Klavans andChodorow 1992 Siegel 1999) or lexical-semantic verb classes such as those proposedby Levin (1993) (Aone and McKee 1996 McCarthy 2000 Lapata and Brew 1999 Schulteim Walde 2000) In this paper we focus on argument structuremdashthe thematic roles as-signed by a verb to its argumentsmdashas the way in which the relational semantics ofthe verb is represented at the syntactic level

Specically our proposal is to automatically classify verbs based on argumentstructure properties using statistical corpus-based methods We address the prob-lem of classication because it provides a means for lexical organization which caneffectively capture generalizations over verbs (Palmer 2000) Within the context ofclassication the use of argument structure provides a ner discrimination amongverbs than that induced by subcategorization frames (as we see below in our exampleclasses which allow the same subcategorizations but differ in thematic assignment)but a coarser classication than that proposed by Levin (in which classes such asours are further subdivided according to more detailed semantic properties) Thislevel of classication granularity appears to be appropriate for numerous languageengineering tasks Because knowledge of argument structure captures fundamentalparticipantevent relations it is crucial in parsing and generation (eg Srinivas andJoshi [1999] Stede [1998]) in machine translation (Dorr 1997) and in information re-trieval (Klavans and Kan 1998) and extraction (Riloff and Schmelzenbach 1998) Ouruse of statistical corpus-based methods to achieve this level of classication is moti-vated by our hypothesis that class-based differences in argument structure are reectedin statistics over the usages of the component verbs and that those statistics can beautomatically extracted from a large annotated corpus

The particular classication problem within which we investigate this hypothesisis the task of learning the three major classes of optionally intransitive verbs in Englishunergative unaccusative and object-drop verbs (For the unergativeunaccusative dis-tinction see Perlmutter [1978] Burzio [1986] Levin and Rappaport Hovav [1995])Table 1 shows an example of a verb from each class in its transitive and intransitiveusages These three classes are motivated by theoretical linguistic properties (see dis-cussion and references below and in Stevenson and Merlo [1997b] Merlo and Steven-son [2000b]) Furthermore it appears that the classes capture typological distinctionsthat are useful for machine translation (for example causative unergatives are un-grammatical in many languages) as well as processing distinctions that are useful forgenerating naturally occurring language (for example reduced relatives with unerga-tive verbs are awkward but they are acceptable and in fact often preferred to fullrelatives for unaccusative and object-drop verbs) (Stevenson and Merlo 1997b Merloand Stevenson 1998)

374

Merlo and Stevenson Statistical Verb Classication

Table 2Summary of thematic role assignments by class

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

The question then is what underlies these distinctions We identify the propertythat precisely distinguishes among these three classes as that of argument structuremdashie the thematic roles assigned by the verbs The thematic roles for each class andtheir mapping to subject and object positions are summarized in Table 2 Note thatverbs across these three classes allow the same subcategorization frames (taking an NPobject or occurring intransitively) thus classication based on subcategorization alonewould not distinguish them On the other hand each of the three classes is comprisedof multiple Levin classes because the latter reect more detailed semantic distinctionsamong the verbs (Levin 1993) thus classication based on Levinrsquos labeling wouldmiss generalizations across the three broader classes By contrast as shown in Table 2each class has a unique pattern of thematic assignments which categorize the verbsprecisely into the three classes of interest

Although the granularity of our classication differs from Levinrsquos we draw on herhypothesis that semantic properties of verbs are reected in their syntactic behaviorThe behavior that Levin focuses on is the notion of diathesis alternationmdashan alter-nation in the expression of the arguments of a verb such as the different mappingsbetween transitive and intransitive that our verbs undergo Whether a verb partici-pates in a particular diathesis alternation or not is a key factor in Levinrsquos approach toclassication We like others in a computational framework have extended this ideaby showing that statistics over the alternants of a verb effectively capture informationabout its class (Lapata 1999 McCarthy 2000 Lapata and Brew 1999)

In our specic task we analyze the pattern of thematic assignments given inTable 2 to develop statistical indicators that are able to determine the class of an op-tionally intransitive verb by capturing information across its transitive and intransitivealternants These indicators serve as input to a machine learning algorithm under asupervised training methodology which produces an automatic classication systemfor our three verb classes Since we rely on patterns of behavior across multiple occur-rences of a verb we begin with the problem of assigning a single class to the entireset of usages of a verb within the corpus For example we measure properties acrossall occurrences of a word such as raced in order to assign a single classication tothe lexical entry for the verb race This contrasts with work classifying individual oc-currences of a verb in each local context which have typically relied on training thatincludes instances of the verbs to be classiedmdashessentially developing a bias that isused in conjunction with the local context to determine the best classication for newinstances of previously seen verbs By contrast our method assigns a classicationto verbs that have not previously been seen in the training data Thus while we donot as yet assign different classes to the instances of a verb we can assign a singlepredominant class to new verbs that have never been encountered

To preview our results we demonstrate that combining just ve numerical indi-cators automatically extracted from large text corpora is sufcient to reduce the error

375

Computational Linguistics Volume 27 Number 3

rate in this classication task by more than 50 over chance Specically we achievealmost 70 accuracy in a task whose baseline (chance) performance is 34 and whoseexpert-based upper bound is calculated at 865

Beyond the interest for the particular classication task at hand this work ad-dresses more general issues concerning verb class distinctions based in argumentstructure We evaluate our hypothesis that such distinctions are reected in statis-tics over corpora through a computational experimental methodology in which weinvestigate as indicated each of the subhypotheses below in the context of the threeverb classes under study

deg Lexical features capture argument structure differences between verbclasses1

deg The linguistically distinctive features exhibit distributional differencesacross the verb classes that are apparent within linguistic experience (iethey can be collected from text)

deg The statistical distributions of (some of) the features contribute tolearning the classications of the verbs

In the following sections we show that all three hypotheses above are borne out InSection 2 we describe the argument structure distinctions of our three verb classesin more detail In support of the rst hypothesis above we discuss lexical correlatesof the underlying differences in thematic assignments that distinguish the three verbclasses under investigation In Section 3 we show how to approximate these featuresby simple syntactic counts and how to perform these counts on available corpora Weconrm the second hypothesis above by showing that the differences in distributionpredicted by the underlying argument structures are largely found in the data InSection 4 in a series of machine learning experiments and a detailed analysis of errorswe conrm the third hypothesis by showing that the differences in the distribution ofthe extracted features are successfully used for verb classication Section 5 evaluatesthe signicance of these results by comparing the programrsquos accuracy to an expert-based upper bound We conclude the paper with a discussion of its contributionscomparison to related work and suggestions for future extensions

2 Deriving Classication Features from Argument Structure

Our task is to automatically build a classier that can distinguish the three majorclasses of optionally intransitive verbs in English As described above these classesare differentiated by their argument structures In the rst subsection below we elab-orate on our description of the thematic role assignments for each of the verb classesunder investigationmdashunergative unaccusative and object-drop This analysis yields adistinctive pattern of thematic assignment for each class (For more detailed discussionconcerning the linguistic properties of these classes and the behavior of their compo-nent verbs please see Stevenson and Merlo [1997b] Merlo and Stevenson [2000b])

Of course the key to any automatic classication task is to determine a set of usefulfeatures for discriminating the items to be classied In the second subsection below

1 By lexical we mean features that we think are likely stored in the lexicon because they are propertiesof words and not of phrases or sentences Note however that some lexical features may notnecessarily be stored with individual wordsmdashindeed the motivation for classifying verbs to capturegeneralizations within each class suggests otherwise

376

Merlo and Stevenson Statistical Verb Classication

we show how the analysis of thematic distinctions enables us to determine lexicalproperties that we hypothesize will exhibit useful detectable frequency differencesin our corpora and thus serve as the machine learning features for our classicationexperiments

21 The Argument Structure DistinctionsThe verb classes are exemplied below in sentences repeated from Table 1 for ease ofexposition

Unergative (1a) The horse raced past the barn

(1b) The jockey raced the horse past the barn

Unaccusative (2a) The butter melted in the pan

(2b) The cook melted the butter in the pan

Object-Drop (3a) The boy played

(3b) The boy played soccer

The example sentences illustrate that all three classes participate in a diathesis alter-nation that relates a transitive and intransitive form of the verb However accordingto Levin (1993) each class exhibits a different type of diathesis alternation which isdetermined by the particular semantic relations of the arguments to the verb We makethese distinctions explicit by drawing on a standard notion of thematic role as eachclass has a distinct pattern of thematic assignments (ie different argument structures)

We assume here that a thematic role is a label taken from a xed inventory ofgrammaticalized semantic relations for example an Agent is the doer of an actionand a Theme is the entity undergoing an event (Gruber 1965) While admitting thatsuch notions as Agent and Theme lack formal denitions (in our work and in theliterature more widely) the distinctions are clear enough to discriminate our threeverb classes For our purposes these roles can simply be thought of as semantic labelswhich are non-decomposable but there is nothing in our approach that rests on thisassumption Thus our approach would also be compatible with a feature-based de-nition of participant roles as long as the features capture such general distinctions asfor example the doer of an action and the entity acted upon (Dowty 1991)

Note that in our focus on verb class distinctions we have not considered ner-grained features that rely on more specic semantic features such as for examplethat the subject of the intransitive melt must be something that can change from solidto liquid While this type of feature may be important for semantic distinctions amongindividual verbs it thus far seems irrelevant to the level of verb classication thatwe adopt which groups verbs more broadly according to syntactic and (somewhatcoarser-grained) semantic properties

Our analysis of thematic assignmentmdashwhich was summarized in Table 2 repeatedhere as Table 3mdashis elaborated here for each verb class The sentences in (1) aboveillustrate the relevant alternants of an unergative verb race Unergatives are intransitiveaction verbs whose transitive form as in (1b) can be the causative counterpart of theintransitive form (1a) The type of causative alternation that unergatives participate inis the ldquoinduced action alternationrdquo according to Levin (1993) For our thematic analysiswe note that the subject of an intransitive activity verb is specied to be an AgentThe subject of the transitive form is indicated by the label Agent of Causation whichindicates that the thematic role assigned to the subject is marked as the role which is

377

Computational Linguistics Volume 27 Number 3

Table 3Summary of thematic assignments

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

introduced with the causing event In a causative alternation the semantic argumentof the subject of the intransitive surfaces as the object of the transitive (Brousseau andRitter 1991 Hale and Keyser 1993 Levin 1993 Levin and Rappaport Hovav 1995) Forunergatives this argument is an Agent and thus the alternation yields an object inthe transitive form that receives an Agent thematic role (Cruse 1972) These thematicassignments are shown in the rst row of Table 3

The sentences in (2) illustrate the corresponding forms of an unaccusative verbmelt Unaccusatives are intransitive change-of-state verbs as in (2a) the transitivecounterpart for these verbs also exhibits a causative alternation as in (2b) This isthe ldquocausativeinchoative alternationrdquo (Levin 1993) Like unergatives the subject ofa transitive unaccusative is marked as the Agent of Causation Unlike unergativesthough the alternating argument of an unaccusative (the subject of the intransitiveform that becomes the object of the transitive) is an entity undergoing a change ofstate without active participation and is therefore a Theme The resulting pattern ofthematic assignments is indicated in the second row of Table 3

The sentences in (3) use an object-drop verb play These are activity verbs thatexhibit a non-causative diathesis alternation in which the object is simply optionalThis is dubbed ldquothe unexpressed object alternationrdquo (Levin 1993) and has severalsubtypes that we do not distinguish here The thematic assignment for these verbs issimply Agent for the subject (in both transitive and intransitive forms) and Themefor the optional object see the last row of Table 3

For further details and support of this analysis please see the discussion in Steven-son and Merlo (1997b) and Merlo and Stevenson (2000b) For our purposes here theimportant fact to note is that each of the three classes can be uniquely identied bythe pattern of thematic assignments across the two alternants of the verbs

22 Features for Automatic ClassicationOur next task then is to derive from these thematic patterns useful features for au-tomatically classifying the verbs In what follows we refer to the columns of Table 3to explain how we expect the thematic distinctions to give rise to distributional prop-erties which when appropriately approximated through corpus counts will discrim-inate across the three classes

Transitivity Consider the rst two columns of thematic roles in Table 3 which illus-trate the role assignment in the transitive construction The Prague schoolrsquos notionof linguistic markedness (Jakobson 1971 Trubetzkoy 1939) enables us to establish ascale of markedness of these thematic assignments and make a principled predictionabout their frequency of occurrence Typical tests to determine the unmarked elementof a pair or scale are simplicitymdashthe unmarked element is simpler distributionmdashtheunmarked member is more widely attested across languages and frequencymdashthe un-

378

Merlo and Stevenson Statistical Verb Classication

marked member is more frequent (Greenberg 1966 Moravcsik and Wirth 1983) Theclaim of markedness theory is that once an element has been identied by one testas the unmarked element of a scale then all other tests will be correlated The threethematic assignments appear to be ranked on a scale by the simplicity and distribu-tion tests as we describe below From this we can conclude that frequency as a thirdcorrelated test should also be ranked by the same scale and we can therefore makepredictions about the expected frequencies of the three thematic assignments

First we note that the specication of an Agent of Causation for transitive unerga-tives (such as race) and unaccusatives (such as melt) indicates a causative constructionCausative constructions relate two events the causing event and the core event de-scribed by the intransitive verb the Agent of Causation is the Agent of the causingevent This double event structure can be considered as more complex than the sin-gle event that is found in a transitive object-drop verb (such as play) (Stevenson andMerlo 1997b) The simplicity test thus indicates that the causative unergatives andunaccusatives are marked in comparison to the transitive object-drop verbs

We further observe that the causative transitive of an unergative verb has an Agentthematic role in object position which is subordinated to the Agent of Causation insubject position yielding an unusual ldquodouble agentiverdquo thematic structure This lex-ical causativization of unergatives (in contrast to analytic causativization) is a distri-butionally rarer phenomenonmdashfound in fewer languagesmdashthan lexical causatives ofunaccusatives In asking native speakers about our verbs we have found that lexicalcausatives of unergative verbs are not attested in Italian French German PortugueseGungbe (Kwa family) and Czech On the other hand the lexical causatives are pos-sible for unaccusative verbs (ie where the object is a Theme) in all these languagesVietnamese appears to allow a very restricted form of causativization of unergativeslimited to only those cases that have a comitative reading The typological distribu-tion test thus indicates that unergatives are more marked than unaccusatives in thetransitive form

From these observations we can conclude that unergatives (such as race) havethe most marked transitive argument structure unaccusatives (such as melt) havean intermediately marked transitive argument structure and object-drops (such asplay) have the least marked transitive argument structure of the three Under theassumptions of markedness theory outlined above we then predict that unergativesare the least frequent in the transitive that unaccusatives have intermediate frequencyin the transitive and that object-drop verbs are the most frequent in the transitive

Causativity Due to the causative alternation of unergatives and unaccusatives thethematic role of the subject of the intransitive is identical to that of the object of thetransitive as shown in the second and third columns of thematic roles in Table 3Given the identity of thematic role mapped to subject and object positions across thetwo alternants we expect to observe the same noun occurring at times as subject ofthe verb and at other times as object of the verb In contrast for object-drop verbsthe thematic role of the subject of the intransitive is identical to that of the subject ofthe transitive not the object of the transitive We therefore expect that it will be lesscommon for the same noun to occur in subject and object position across instances ofthe same object-drop verb

Thus we hypothesize that this pattern of thematic role assignments will be re-ected in a differential amount of usage across the classes of the same nouns as sub-jects and objects for a given verb Generally we would expect that causative verbs(in our case the unergative and unaccusative verbs) would have a greater degree ofoverlap of nouns in subject and object position than non-causative transitive verbs (in

379

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 2: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

Table 1Examples of verbs from the three optionally intransitive classes

Unergative The horse raced past the barnThe jockey raced the horse past the barn

Unaccusative The butter melted in the panThe cook melted the butter in the pan

Object-Drop The boy playedThe boy played soccer

Other work has attempted to learn deeper semantic properties such as selectional re-strictions (Resnik 1996 Riloff and Schmelzenbach 1998) verbal aspect (Klavans andChodorow 1992 Siegel 1999) or lexical-semantic verb classes such as those proposedby Levin (1993) (Aone and McKee 1996 McCarthy 2000 Lapata and Brew 1999 Schulteim Walde 2000) In this paper we focus on argument structuremdashthe thematic roles as-signed by a verb to its argumentsmdashas the way in which the relational semantics ofthe verb is represented at the syntactic level

Specically our proposal is to automatically classify verbs based on argumentstructure properties using statistical corpus-based methods We address the prob-lem of classication because it provides a means for lexical organization which caneffectively capture generalizations over verbs (Palmer 2000) Within the context ofclassication the use of argument structure provides a ner discrimination amongverbs than that induced by subcategorization frames (as we see below in our exampleclasses which allow the same subcategorizations but differ in thematic assignment)but a coarser classication than that proposed by Levin (in which classes such asours are further subdivided according to more detailed semantic properties) Thislevel of classication granularity appears to be appropriate for numerous languageengineering tasks Because knowledge of argument structure captures fundamentalparticipantevent relations it is crucial in parsing and generation (eg Srinivas andJoshi [1999] Stede [1998]) in machine translation (Dorr 1997) and in information re-trieval (Klavans and Kan 1998) and extraction (Riloff and Schmelzenbach 1998) Ouruse of statistical corpus-based methods to achieve this level of classication is moti-vated by our hypothesis that class-based differences in argument structure are reectedin statistics over the usages of the component verbs and that those statistics can beautomatically extracted from a large annotated corpus

The particular classication problem within which we investigate this hypothesisis the task of learning the three major classes of optionally intransitive verbs in Englishunergative unaccusative and object-drop verbs (For the unergativeunaccusative dis-tinction see Perlmutter [1978] Burzio [1986] Levin and Rappaport Hovav [1995])Table 1 shows an example of a verb from each class in its transitive and intransitiveusages These three classes are motivated by theoretical linguistic properties (see dis-cussion and references below and in Stevenson and Merlo [1997b] Merlo and Steven-son [2000b]) Furthermore it appears that the classes capture typological distinctionsthat are useful for machine translation (for example causative unergatives are un-grammatical in many languages) as well as processing distinctions that are useful forgenerating naturally occurring language (for example reduced relatives with unerga-tive verbs are awkward but they are acceptable and in fact often preferred to fullrelatives for unaccusative and object-drop verbs) (Stevenson and Merlo 1997b Merloand Stevenson 1998)

374

Merlo and Stevenson Statistical Verb Classication

Table 2Summary of thematic role assignments by class

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

The question then is what underlies these distinctions We identify the propertythat precisely distinguishes among these three classes as that of argument structuremdashie the thematic roles assigned by the verbs The thematic roles for each class andtheir mapping to subject and object positions are summarized in Table 2 Note thatverbs across these three classes allow the same subcategorization frames (taking an NPobject or occurring intransitively) thus classication based on subcategorization alonewould not distinguish them On the other hand each of the three classes is comprisedof multiple Levin classes because the latter reect more detailed semantic distinctionsamong the verbs (Levin 1993) thus classication based on Levinrsquos labeling wouldmiss generalizations across the three broader classes By contrast as shown in Table 2each class has a unique pattern of thematic assignments which categorize the verbsprecisely into the three classes of interest

Although the granularity of our classication differs from Levinrsquos we draw on herhypothesis that semantic properties of verbs are reected in their syntactic behaviorThe behavior that Levin focuses on is the notion of diathesis alternationmdashan alter-nation in the expression of the arguments of a verb such as the different mappingsbetween transitive and intransitive that our verbs undergo Whether a verb partici-pates in a particular diathesis alternation or not is a key factor in Levinrsquos approach toclassication We like others in a computational framework have extended this ideaby showing that statistics over the alternants of a verb effectively capture informationabout its class (Lapata 1999 McCarthy 2000 Lapata and Brew 1999)

In our specic task we analyze the pattern of thematic assignments given inTable 2 to develop statistical indicators that are able to determine the class of an op-tionally intransitive verb by capturing information across its transitive and intransitivealternants These indicators serve as input to a machine learning algorithm under asupervised training methodology which produces an automatic classication systemfor our three verb classes Since we rely on patterns of behavior across multiple occur-rences of a verb we begin with the problem of assigning a single class to the entireset of usages of a verb within the corpus For example we measure properties acrossall occurrences of a word such as raced in order to assign a single classication tothe lexical entry for the verb race This contrasts with work classifying individual oc-currences of a verb in each local context which have typically relied on training thatincludes instances of the verbs to be classiedmdashessentially developing a bias that isused in conjunction with the local context to determine the best classication for newinstances of previously seen verbs By contrast our method assigns a classicationto verbs that have not previously been seen in the training data Thus while we donot as yet assign different classes to the instances of a verb we can assign a singlepredominant class to new verbs that have never been encountered

To preview our results we demonstrate that combining just ve numerical indi-cators automatically extracted from large text corpora is sufcient to reduce the error

375

Computational Linguistics Volume 27 Number 3

rate in this classication task by more than 50 over chance Specically we achievealmost 70 accuracy in a task whose baseline (chance) performance is 34 and whoseexpert-based upper bound is calculated at 865

Beyond the interest for the particular classication task at hand this work ad-dresses more general issues concerning verb class distinctions based in argumentstructure We evaluate our hypothesis that such distinctions are reected in statis-tics over corpora through a computational experimental methodology in which weinvestigate as indicated each of the subhypotheses below in the context of the threeverb classes under study

deg Lexical features capture argument structure differences between verbclasses1

deg The linguistically distinctive features exhibit distributional differencesacross the verb classes that are apparent within linguistic experience (iethey can be collected from text)

deg The statistical distributions of (some of) the features contribute tolearning the classications of the verbs

In the following sections we show that all three hypotheses above are borne out InSection 2 we describe the argument structure distinctions of our three verb classesin more detail In support of the rst hypothesis above we discuss lexical correlatesof the underlying differences in thematic assignments that distinguish the three verbclasses under investigation In Section 3 we show how to approximate these featuresby simple syntactic counts and how to perform these counts on available corpora Weconrm the second hypothesis above by showing that the differences in distributionpredicted by the underlying argument structures are largely found in the data InSection 4 in a series of machine learning experiments and a detailed analysis of errorswe conrm the third hypothesis by showing that the differences in the distribution ofthe extracted features are successfully used for verb classication Section 5 evaluatesthe signicance of these results by comparing the programrsquos accuracy to an expert-based upper bound We conclude the paper with a discussion of its contributionscomparison to related work and suggestions for future extensions

2 Deriving Classication Features from Argument Structure

Our task is to automatically build a classier that can distinguish the three majorclasses of optionally intransitive verbs in English As described above these classesare differentiated by their argument structures In the rst subsection below we elab-orate on our description of the thematic role assignments for each of the verb classesunder investigationmdashunergative unaccusative and object-drop This analysis yields adistinctive pattern of thematic assignment for each class (For more detailed discussionconcerning the linguistic properties of these classes and the behavior of their compo-nent verbs please see Stevenson and Merlo [1997b] Merlo and Stevenson [2000b])

Of course the key to any automatic classication task is to determine a set of usefulfeatures for discriminating the items to be classied In the second subsection below

1 By lexical we mean features that we think are likely stored in the lexicon because they are propertiesof words and not of phrases or sentences Note however that some lexical features may notnecessarily be stored with individual wordsmdashindeed the motivation for classifying verbs to capturegeneralizations within each class suggests otherwise

376

Merlo and Stevenson Statistical Verb Classication

we show how the analysis of thematic distinctions enables us to determine lexicalproperties that we hypothesize will exhibit useful detectable frequency differencesin our corpora and thus serve as the machine learning features for our classicationexperiments

21 The Argument Structure DistinctionsThe verb classes are exemplied below in sentences repeated from Table 1 for ease ofexposition

Unergative (1a) The horse raced past the barn

(1b) The jockey raced the horse past the barn

Unaccusative (2a) The butter melted in the pan

(2b) The cook melted the butter in the pan

Object-Drop (3a) The boy played

(3b) The boy played soccer

The example sentences illustrate that all three classes participate in a diathesis alter-nation that relates a transitive and intransitive form of the verb However accordingto Levin (1993) each class exhibits a different type of diathesis alternation which isdetermined by the particular semantic relations of the arguments to the verb We makethese distinctions explicit by drawing on a standard notion of thematic role as eachclass has a distinct pattern of thematic assignments (ie different argument structures)

We assume here that a thematic role is a label taken from a xed inventory ofgrammaticalized semantic relations for example an Agent is the doer of an actionand a Theme is the entity undergoing an event (Gruber 1965) While admitting thatsuch notions as Agent and Theme lack formal denitions (in our work and in theliterature more widely) the distinctions are clear enough to discriminate our threeverb classes For our purposes these roles can simply be thought of as semantic labelswhich are non-decomposable but there is nothing in our approach that rests on thisassumption Thus our approach would also be compatible with a feature-based de-nition of participant roles as long as the features capture such general distinctions asfor example the doer of an action and the entity acted upon (Dowty 1991)

Note that in our focus on verb class distinctions we have not considered ner-grained features that rely on more specic semantic features such as for examplethat the subject of the intransitive melt must be something that can change from solidto liquid While this type of feature may be important for semantic distinctions amongindividual verbs it thus far seems irrelevant to the level of verb classication thatwe adopt which groups verbs more broadly according to syntactic and (somewhatcoarser-grained) semantic properties

Our analysis of thematic assignmentmdashwhich was summarized in Table 2 repeatedhere as Table 3mdashis elaborated here for each verb class The sentences in (1) aboveillustrate the relevant alternants of an unergative verb race Unergatives are intransitiveaction verbs whose transitive form as in (1b) can be the causative counterpart of theintransitive form (1a) The type of causative alternation that unergatives participate inis the ldquoinduced action alternationrdquo according to Levin (1993) For our thematic analysiswe note that the subject of an intransitive activity verb is specied to be an AgentThe subject of the transitive form is indicated by the label Agent of Causation whichindicates that the thematic role assigned to the subject is marked as the role which is

377

Computational Linguistics Volume 27 Number 3

Table 3Summary of thematic assignments

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

introduced with the causing event In a causative alternation the semantic argumentof the subject of the intransitive surfaces as the object of the transitive (Brousseau andRitter 1991 Hale and Keyser 1993 Levin 1993 Levin and Rappaport Hovav 1995) Forunergatives this argument is an Agent and thus the alternation yields an object inthe transitive form that receives an Agent thematic role (Cruse 1972) These thematicassignments are shown in the rst row of Table 3

The sentences in (2) illustrate the corresponding forms of an unaccusative verbmelt Unaccusatives are intransitive change-of-state verbs as in (2a) the transitivecounterpart for these verbs also exhibits a causative alternation as in (2b) This isthe ldquocausativeinchoative alternationrdquo (Levin 1993) Like unergatives the subject ofa transitive unaccusative is marked as the Agent of Causation Unlike unergativesthough the alternating argument of an unaccusative (the subject of the intransitiveform that becomes the object of the transitive) is an entity undergoing a change ofstate without active participation and is therefore a Theme The resulting pattern ofthematic assignments is indicated in the second row of Table 3

The sentences in (3) use an object-drop verb play These are activity verbs thatexhibit a non-causative diathesis alternation in which the object is simply optionalThis is dubbed ldquothe unexpressed object alternationrdquo (Levin 1993) and has severalsubtypes that we do not distinguish here The thematic assignment for these verbs issimply Agent for the subject (in both transitive and intransitive forms) and Themefor the optional object see the last row of Table 3

For further details and support of this analysis please see the discussion in Steven-son and Merlo (1997b) and Merlo and Stevenson (2000b) For our purposes here theimportant fact to note is that each of the three classes can be uniquely identied bythe pattern of thematic assignments across the two alternants of the verbs

22 Features for Automatic ClassicationOur next task then is to derive from these thematic patterns useful features for au-tomatically classifying the verbs In what follows we refer to the columns of Table 3to explain how we expect the thematic distinctions to give rise to distributional prop-erties which when appropriately approximated through corpus counts will discrim-inate across the three classes

Transitivity Consider the rst two columns of thematic roles in Table 3 which illus-trate the role assignment in the transitive construction The Prague schoolrsquos notionof linguistic markedness (Jakobson 1971 Trubetzkoy 1939) enables us to establish ascale of markedness of these thematic assignments and make a principled predictionabout their frequency of occurrence Typical tests to determine the unmarked elementof a pair or scale are simplicitymdashthe unmarked element is simpler distributionmdashtheunmarked member is more widely attested across languages and frequencymdashthe un-

378

Merlo and Stevenson Statistical Verb Classication

marked member is more frequent (Greenberg 1966 Moravcsik and Wirth 1983) Theclaim of markedness theory is that once an element has been identied by one testas the unmarked element of a scale then all other tests will be correlated The threethematic assignments appear to be ranked on a scale by the simplicity and distribu-tion tests as we describe below From this we can conclude that frequency as a thirdcorrelated test should also be ranked by the same scale and we can therefore makepredictions about the expected frequencies of the three thematic assignments

First we note that the specication of an Agent of Causation for transitive unerga-tives (such as race) and unaccusatives (such as melt) indicates a causative constructionCausative constructions relate two events the causing event and the core event de-scribed by the intransitive verb the Agent of Causation is the Agent of the causingevent This double event structure can be considered as more complex than the sin-gle event that is found in a transitive object-drop verb (such as play) (Stevenson andMerlo 1997b) The simplicity test thus indicates that the causative unergatives andunaccusatives are marked in comparison to the transitive object-drop verbs

We further observe that the causative transitive of an unergative verb has an Agentthematic role in object position which is subordinated to the Agent of Causation insubject position yielding an unusual ldquodouble agentiverdquo thematic structure This lex-ical causativization of unergatives (in contrast to analytic causativization) is a distri-butionally rarer phenomenonmdashfound in fewer languagesmdashthan lexical causatives ofunaccusatives In asking native speakers about our verbs we have found that lexicalcausatives of unergative verbs are not attested in Italian French German PortugueseGungbe (Kwa family) and Czech On the other hand the lexical causatives are pos-sible for unaccusative verbs (ie where the object is a Theme) in all these languagesVietnamese appears to allow a very restricted form of causativization of unergativeslimited to only those cases that have a comitative reading The typological distribu-tion test thus indicates that unergatives are more marked than unaccusatives in thetransitive form

From these observations we can conclude that unergatives (such as race) havethe most marked transitive argument structure unaccusatives (such as melt) havean intermediately marked transitive argument structure and object-drops (such asplay) have the least marked transitive argument structure of the three Under theassumptions of markedness theory outlined above we then predict that unergativesare the least frequent in the transitive that unaccusatives have intermediate frequencyin the transitive and that object-drop verbs are the most frequent in the transitive

Causativity Due to the causative alternation of unergatives and unaccusatives thethematic role of the subject of the intransitive is identical to that of the object of thetransitive as shown in the second and third columns of thematic roles in Table 3Given the identity of thematic role mapped to subject and object positions across thetwo alternants we expect to observe the same noun occurring at times as subject ofthe verb and at other times as object of the verb In contrast for object-drop verbsthe thematic role of the subject of the intransitive is identical to that of the subject ofthe transitive not the object of the transitive We therefore expect that it will be lesscommon for the same noun to occur in subject and object position across instances ofthe same object-drop verb

Thus we hypothesize that this pattern of thematic role assignments will be re-ected in a differential amount of usage across the classes of the same nouns as sub-jects and objects for a given verb Generally we would expect that causative verbs(in our case the unergative and unaccusative verbs) would have a greater degree ofoverlap of nouns in subject and object position than non-causative transitive verbs (in

379

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 3: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

Table 2Summary of thematic role assignments by class

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

The question then is what underlies these distinctions We identify the propertythat precisely distinguishes among these three classes as that of argument structuremdashie the thematic roles assigned by the verbs The thematic roles for each class andtheir mapping to subject and object positions are summarized in Table 2 Note thatverbs across these three classes allow the same subcategorization frames (taking an NPobject or occurring intransitively) thus classication based on subcategorization alonewould not distinguish them On the other hand each of the three classes is comprisedof multiple Levin classes because the latter reect more detailed semantic distinctionsamong the verbs (Levin 1993) thus classication based on Levinrsquos labeling wouldmiss generalizations across the three broader classes By contrast as shown in Table 2each class has a unique pattern of thematic assignments which categorize the verbsprecisely into the three classes of interest

Although the granularity of our classication differs from Levinrsquos we draw on herhypothesis that semantic properties of verbs are reected in their syntactic behaviorThe behavior that Levin focuses on is the notion of diathesis alternationmdashan alter-nation in the expression of the arguments of a verb such as the different mappingsbetween transitive and intransitive that our verbs undergo Whether a verb partici-pates in a particular diathesis alternation or not is a key factor in Levinrsquos approach toclassication We like others in a computational framework have extended this ideaby showing that statistics over the alternants of a verb effectively capture informationabout its class (Lapata 1999 McCarthy 2000 Lapata and Brew 1999)

In our specic task we analyze the pattern of thematic assignments given inTable 2 to develop statistical indicators that are able to determine the class of an op-tionally intransitive verb by capturing information across its transitive and intransitivealternants These indicators serve as input to a machine learning algorithm under asupervised training methodology which produces an automatic classication systemfor our three verb classes Since we rely on patterns of behavior across multiple occur-rences of a verb we begin with the problem of assigning a single class to the entireset of usages of a verb within the corpus For example we measure properties acrossall occurrences of a word such as raced in order to assign a single classication tothe lexical entry for the verb race This contrasts with work classifying individual oc-currences of a verb in each local context which have typically relied on training thatincludes instances of the verbs to be classiedmdashessentially developing a bias that isused in conjunction with the local context to determine the best classication for newinstances of previously seen verbs By contrast our method assigns a classicationto verbs that have not previously been seen in the training data Thus while we donot as yet assign different classes to the instances of a verb we can assign a singlepredominant class to new verbs that have never been encountered

To preview our results we demonstrate that combining just ve numerical indi-cators automatically extracted from large text corpora is sufcient to reduce the error

375

Computational Linguistics Volume 27 Number 3

rate in this classication task by more than 50 over chance Specically we achievealmost 70 accuracy in a task whose baseline (chance) performance is 34 and whoseexpert-based upper bound is calculated at 865

Beyond the interest for the particular classication task at hand this work ad-dresses more general issues concerning verb class distinctions based in argumentstructure We evaluate our hypothesis that such distinctions are reected in statis-tics over corpora through a computational experimental methodology in which weinvestigate as indicated each of the subhypotheses below in the context of the threeverb classes under study

deg Lexical features capture argument structure differences between verbclasses1

deg The linguistically distinctive features exhibit distributional differencesacross the verb classes that are apparent within linguistic experience (iethey can be collected from text)

deg The statistical distributions of (some of) the features contribute tolearning the classications of the verbs

In the following sections we show that all three hypotheses above are borne out InSection 2 we describe the argument structure distinctions of our three verb classesin more detail In support of the rst hypothesis above we discuss lexical correlatesof the underlying differences in thematic assignments that distinguish the three verbclasses under investigation In Section 3 we show how to approximate these featuresby simple syntactic counts and how to perform these counts on available corpora Weconrm the second hypothesis above by showing that the differences in distributionpredicted by the underlying argument structures are largely found in the data InSection 4 in a series of machine learning experiments and a detailed analysis of errorswe conrm the third hypothesis by showing that the differences in the distribution ofthe extracted features are successfully used for verb classication Section 5 evaluatesthe signicance of these results by comparing the programrsquos accuracy to an expert-based upper bound We conclude the paper with a discussion of its contributionscomparison to related work and suggestions for future extensions

2 Deriving Classication Features from Argument Structure

Our task is to automatically build a classier that can distinguish the three majorclasses of optionally intransitive verbs in English As described above these classesare differentiated by their argument structures In the rst subsection below we elab-orate on our description of the thematic role assignments for each of the verb classesunder investigationmdashunergative unaccusative and object-drop This analysis yields adistinctive pattern of thematic assignment for each class (For more detailed discussionconcerning the linguistic properties of these classes and the behavior of their compo-nent verbs please see Stevenson and Merlo [1997b] Merlo and Stevenson [2000b])

Of course the key to any automatic classication task is to determine a set of usefulfeatures for discriminating the items to be classied In the second subsection below

1 By lexical we mean features that we think are likely stored in the lexicon because they are propertiesof words and not of phrases or sentences Note however that some lexical features may notnecessarily be stored with individual wordsmdashindeed the motivation for classifying verbs to capturegeneralizations within each class suggests otherwise

376

Merlo and Stevenson Statistical Verb Classication

we show how the analysis of thematic distinctions enables us to determine lexicalproperties that we hypothesize will exhibit useful detectable frequency differencesin our corpora and thus serve as the machine learning features for our classicationexperiments

21 The Argument Structure DistinctionsThe verb classes are exemplied below in sentences repeated from Table 1 for ease ofexposition

Unergative (1a) The horse raced past the barn

(1b) The jockey raced the horse past the barn

Unaccusative (2a) The butter melted in the pan

(2b) The cook melted the butter in the pan

Object-Drop (3a) The boy played

(3b) The boy played soccer

The example sentences illustrate that all three classes participate in a diathesis alter-nation that relates a transitive and intransitive form of the verb However accordingto Levin (1993) each class exhibits a different type of diathesis alternation which isdetermined by the particular semantic relations of the arguments to the verb We makethese distinctions explicit by drawing on a standard notion of thematic role as eachclass has a distinct pattern of thematic assignments (ie different argument structures)

We assume here that a thematic role is a label taken from a xed inventory ofgrammaticalized semantic relations for example an Agent is the doer of an actionand a Theme is the entity undergoing an event (Gruber 1965) While admitting thatsuch notions as Agent and Theme lack formal denitions (in our work and in theliterature more widely) the distinctions are clear enough to discriminate our threeverb classes For our purposes these roles can simply be thought of as semantic labelswhich are non-decomposable but there is nothing in our approach that rests on thisassumption Thus our approach would also be compatible with a feature-based de-nition of participant roles as long as the features capture such general distinctions asfor example the doer of an action and the entity acted upon (Dowty 1991)

Note that in our focus on verb class distinctions we have not considered ner-grained features that rely on more specic semantic features such as for examplethat the subject of the intransitive melt must be something that can change from solidto liquid While this type of feature may be important for semantic distinctions amongindividual verbs it thus far seems irrelevant to the level of verb classication thatwe adopt which groups verbs more broadly according to syntactic and (somewhatcoarser-grained) semantic properties

Our analysis of thematic assignmentmdashwhich was summarized in Table 2 repeatedhere as Table 3mdashis elaborated here for each verb class The sentences in (1) aboveillustrate the relevant alternants of an unergative verb race Unergatives are intransitiveaction verbs whose transitive form as in (1b) can be the causative counterpart of theintransitive form (1a) The type of causative alternation that unergatives participate inis the ldquoinduced action alternationrdquo according to Levin (1993) For our thematic analysiswe note that the subject of an intransitive activity verb is specied to be an AgentThe subject of the transitive form is indicated by the label Agent of Causation whichindicates that the thematic role assigned to the subject is marked as the role which is

377

Computational Linguistics Volume 27 Number 3

Table 3Summary of thematic assignments

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

introduced with the causing event In a causative alternation the semantic argumentof the subject of the intransitive surfaces as the object of the transitive (Brousseau andRitter 1991 Hale and Keyser 1993 Levin 1993 Levin and Rappaport Hovav 1995) Forunergatives this argument is an Agent and thus the alternation yields an object inthe transitive form that receives an Agent thematic role (Cruse 1972) These thematicassignments are shown in the rst row of Table 3

The sentences in (2) illustrate the corresponding forms of an unaccusative verbmelt Unaccusatives are intransitive change-of-state verbs as in (2a) the transitivecounterpart for these verbs also exhibits a causative alternation as in (2b) This isthe ldquocausativeinchoative alternationrdquo (Levin 1993) Like unergatives the subject ofa transitive unaccusative is marked as the Agent of Causation Unlike unergativesthough the alternating argument of an unaccusative (the subject of the intransitiveform that becomes the object of the transitive) is an entity undergoing a change ofstate without active participation and is therefore a Theme The resulting pattern ofthematic assignments is indicated in the second row of Table 3

The sentences in (3) use an object-drop verb play These are activity verbs thatexhibit a non-causative diathesis alternation in which the object is simply optionalThis is dubbed ldquothe unexpressed object alternationrdquo (Levin 1993) and has severalsubtypes that we do not distinguish here The thematic assignment for these verbs issimply Agent for the subject (in both transitive and intransitive forms) and Themefor the optional object see the last row of Table 3

For further details and support of this analysis please see the discussion in Steven-son and Merlo (1997b) and Merlo and Stevenson (2000b) For our purposes here theimportant fact to note is that each of the three classes can be uniquely identied bythe pattern of thematic assignments across the two alternants of the verbs

22 Features for Automatic ClassicationOur next task then is to derive from these thematic patterns useful features for au-tomatically classifying the verbs In what follows we refer to the columns of Table 3to explain how we expect the thematic distinctions to give rise to distributional prop-erties which when appropriately approximated through corpus counts will discrim-inate across the three classes

Transitivity Consider the rst two columns of thematic roles in Table 3 which illus-trate the role assignment in the transitive construction The Prague schoolrsquos notionof linguistic markedness (Jakobson 1971 Trubetzkoy 1939) enables us to establish ascale of markedness of these thematic assignments and make a principled predictionabout their frequency of occurrence Typical tests to determine the unmarked elementof a pair or scale are simplicitymdashthe unmarked element is simpler distributionmdashtheunmarked member is more widely attested across languages and frequencymdashthe un-

378

Merlo and Stevenson Statistical Verb Classication

marked member is more frequent (Greenberg 1966 Moravcsik and Wirth 1983) Theclaim of markedness theory is that once an element has been identied by one testas the unmarked element of a scale then all other tests will be correlated The threethematic assignments appear to be ranked on a scale by the simplicity and distribu-tion tests as we describe below From this we can conclude that frequency as a thirdcorrelated test should also be ranked by the same scale and we can therefore makepredictions about the expected frequencies of the three thematic assignments

First we note that the specication of an Agent of Causation for transitive unerga-tives (such as race) and unaccusatives (such as melt) indicates a causative constructionCausative constructions relate two events the causing event and the core event de-scribed by the intransitive verb the Agent of Causation is the Agent of the causingevent This double event structure can be considered as more complex than the sin-gle event that is found in a transitive object-drop verb (such as play) (Stevenson andMerlo 1997b) The simplicity test thus indicates that the causative unergatives andunaccusatives are marked in comparison to the transitive object-drop verbs

We further observe that the causative transitive of an unergative verb has an Agentthematic role in object position which is subordinated to the Agent of Causation insubject position yielding an unusual ldquodouble agentiverdquo thematic structure This lex-ical causativization of unergatives (in contrast to analytic causativization) is a distri-butionally rarer phenomenonmdashfound in fewer languagesmdashthan lexical causatives ofunaccusatives In asking native speakers about our verbs we have found that lexicalcausatives of unergative verbs are not attested in Italian French German PortugueseGungbe (Kwa family) and Czech On the other hand the lexical causatives are pos-sible for unaccusative verbs (ie where the object is a Theme) in all these languagesVietnamese appears to allow a very restricted form of causativization of unergativeslimited to only those cases that have a comitative reading The typological distribu-tion test thus indicates that unergatives are more marked than unaccusatives in thetransitive form

From these observations we can conclude that unergatives (such as race) havethe most marked transitive argument structure unaccusatives (such as melt) havean intermediately marked transitive argument structure and object-drops (such asplay) have the least marked transitive argument structure of the three Under theassumptions of markedness theory outlined above we then predict that unergativesare the least frequent in the transitive that unaccusatives have intermediate frequencyin the transitive and that object-drop verbs are the most frequent in the transitive

Causativity Due to the causative alternation of unergatives and unaccusatives thethematic role of the subject of the intransitive is identical to that of the object of thetransitive as shown in the second and third columns of thematic roles in Table 3Given the identity of thematic role mapped to subject and object positions across thetwo alternants we expect to observe the same noun occurring at times as subject ofthe verb and at other times as object of the verb In contrast for object-drop verbsthe thematic role of the subject of the intransitive is identical to that of the subject ofthe transitive not the object of the transitive We therefore expect that it will be lesscommon for the same noun to occur in subject and object position across instances ofthe same object-drop verb

Thus we hypothesize that this pattern of thematic role assignments will be re-ected in a differential amount of usage across the classes of the same nouns as sub-jects and objects for a given verb Generally we would expect that causative verbs(in our case the unergative and unaccusative verbs) would have a greater degree ofoverlap of nouns in subject and object position than non-causative transitive verbs (in

379

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 4: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

rate in this classication task by more than 50 over chance Specically we achievealmost 70 accuracy in a task whose baseline (chance) performance is 34 and whoseexpert-based upper bound is calculated at 865

Beyond the interest for the particular classication task at hand this work ad-dresses more general issues concerning verb class distinctions based in argumentstructure We evaluate our hypothesis that such distinctions are reected in statis-tics over corpora through a computational experimental methodology in which weinvestigate as indicated each of the subhypotheses below in the context of the threeverb classes under study

deg Lexical features capture argument structure differences between verbclasses1

deg The linguistically distinctive features exhibit distributional differencesacross the verb classes that are apparent within linguistic experience (iethey can be collected from text)

deg The statistical distributions of (some of) the features contribute tolearning the classications of the verbs

In the following sections we show that all three hypotheses above are borne out InSection 2 we describe the argument structure distinctions of our three verb classesin more detail In support of the rst hypothesis above we discuss lexical correlatesof the underlying differences in thematic assignments that distinguish the three verbclasses under investigation In Section 3 we show how to approximate these featuresby simple syntactic counts and how to perform these counts on available corpora Weconrm the second hypothesis above by showing that the differences in distributionpredicted by the underlying argument structures are largely found in the data InSection 4 in a series of machine learning experiments and a detailed analysis of errorswe conrm the third hypothesis by showing that the differences in the distribution ofthe extracted features are successfully used for verb classication Section 5 evaluatesthe signicance of these results by comparing the programrsquos accuracy to an expert-based upper bound We conclude the paper with a discussion of its contributionscomparison to related work and suggestions for future extensions

2 Deriving Classication Features from Argument Structure

Our task is to automatically build a classier that can distinguish the three majorclasses of optionally intransitive verbs in English As described above these classesare differentiated by their argument structures In the rst subsection below we elab-orate on our description of the thematic role assignments for each of the verb classesunder investigationmdashunergative unaccusative and object-drop This analysis yields adistinctive pattern of thematic assignment for each class (For more detailed discussionconcerning the linguistic properties of these classes and the behavior of their compo-nent verbs please see Stevenson and Merlo [1997b] Merlo and Stevenson [2000b])

Of course the key to any automatic classication task is to determine a set of usefulfeatures for discriminating the items to be classied In the second subsection below

1 By lexical we mean features that we think are likely stored in the lexicon because they are propertiesof words and not of phrases or sentences Note however that some lexical features may notnecessarily be stored with individual wordsmdashindeed the motivation for classifying verbs to capturegeneralizations within each class suggests otherwise

376

Merlo and Stevenson Statistical Verb Classication

we show how the analysis of thematic distinctions enables us to determine lexicalproperties that we hypothesize will exhibit useful detectable frequency differencesin our corpora and thus serve as the machine learning features for our classicationexperiments

21 The Argument Structure DistinctionsThe verb classes are exemplied below in sentences repeated from Table 1 for ease ofexposition

Unergative (1a) The horse raced past the barn

(1b) The jockey raced the horse past the barn

Unaccusative (2a) The butter melted in the pan

(2b) The cook melted the butter in the pan

Object-Drop (3a) The boy played

(3b) The boy played soccer

The example sentences illustrate that all three classes participate in a diathesis alter-nation that relates a transitive and intransitive form of the verb However accordingto Levin (1993) each class exhibits a different type of diathesis alternation which isdetermined by the particular semantic relations of the arguments to the verb We makethese distinctions explicit by drawing on a standard notion of thematic role as eachclass has a distinct pattern of thematic assignments (ie different argument structures)

We assume here that a thematic role is a label taken from a xed inventory ofgrammaticalized semantic relations for example an Agent is the doer of an actionand a Theme is the entity undergoing an event (Gruber 1965) While admitting thatsuch notions as Agent and Theme lack formal denitions (in our work and in theliterature more widely) the distinctions are clear enough to discriminate our threeverb classes For our purposes these roles can simply be thought of as semantic labelswhich are non-decomposable but there is nothing in our approach that rests on thisassumption Thus our approach would also be compatible with a feature-based de-nition of participant roles as long as the features capture such general distinctions asfor example the doer of an action and the entity acted upon (Dowty 1991)

Note that in our focus on verb class distinctions we have not considered ner-grained features that rely on more specic semantic features such as for examplethat the subject of the intransitive melt must be something that can change from solidto liquid While this type of feature may be important for semantic distinctions amongindividual verbs it thus far seems irrelevant to the level of verb classication thatwe adopt which groups verbs more broadly according to syntactic and (somewhatcoarser-grained) semantic properties

Our analysis of thematic assignmentmdashwhich was summarized in Table 2 repeatedhere as Table 3mdashis elaborated here for each verb class The sentences in (1) aboveillustrate the relevant alternants of an unergative verb race Unergatives are intransitiveaction verbs whose transitive form as in (1b) can be the causative counterpart of theintransitive form (1a) The type of causative alternation that unergatives participate inis the ldquoinduced action alternationrdquo according to Levin (1993) For our thematic analysiswe note that the subject of an intransitive activity verb is specied to be an AgentThe subject of the transitive form is indicated by the label Agent of Causation whichindicates that the thematic role assigned to the subject is marked as the role which is

377

Computational Linguistics Volume 27 Number 3

Table 3Summary of thematic assignments

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

introduced with the causing event In a causative alternation the semantic argumentof the subject of the intransitive surfaces as the object of the transitive (Brousseau andRitter 1991 Hale and Keyser 1993 Levin 1993 Levin and Rappaport Hovav 1995) Forunergatives this argument is an Agent and thus the alternation yields an object inthe transitive form that receives an Agent thematic role (Cruse 1972) These thematicassignments are shown in the rst row of Table 3

The sentences in (2) illustrate the corresponding forms of an unaccusative verbmelt Unaccusatives are intransitive change-of-state verbs as in (2a) the transitivecounterpart for these verbs also exhibits a causative alternation as in (2b) This isthe ldquocausativeinchoative alternationrdquo (Levin 1993) Like unergatives the subject ofa transitive unaccusative is marked as the Agent of Causation Unlike unergativesthough the alternating argument of an unaccusative (the subject of the intransitiveform that becomes the object of the transitive) is an entity undergoing a change ofstate without active participation and is therefore a Theme The resulting pattern ofthematic assignments is indicated in the second row of Table 3

The sentences in (3) use an object-drop verb play These are activity verbs thatexhibit a non-causative diathesis alternation in which the object is simply optionalThis is dubbed ldquothe unexpressed object alternationrdquo (Levin 1993) and has severalsubtypes that we do not distinguish here The thematic assignment for these verbs issimply Agent for the subject (in both transitive and intransitive forms) and Themefor the optional object see the last row of Table 3

For further details and support of this analysis please see the discussion in Steven-son and Merlo (1997b) and Merlo and Stevenson (2000b) For our purposes here theimportant fact to note is that each of the three classes can be uniquely identied bythe pattern of thematic assignments across the two alternants of the verbs

22 Features for Automatic ClassicationOur next task then is to derive from these thematic patterns useful features for au-tomatically classifying the verbs In what follows we refer to the columns of Table 3to explain how we expect the thematic distinctions to give rise to distributional prop-erties which when appropriately approximated through corpus counts will discrim-inate across the three classes

Transitivity Consider the rst two columns of thematic roles in Table 3 which illus-trate the role assignment in the transitive construction The Prague schoolrsquos notionof linguistic markedness (Jakobson 1971 Trubetzkoy 1939) enables us to establish ascale of markedness of these thematic assignments and make a principled predictionabout their frequency of occurrence Typical tests to determine the unmarked elementof a pair or scale are simplicitymdashthe unmarked element is simpler distributionmdashtheunmarked member is more widely attested across languages and frequencymdashthe un-

378

Merlo and Stevenson Statistical Verb Classication

marked member is more frequent (Greenberg 1966 Moravcsik and Wirth 1983) Theclaim of markedness theory is that once an element has been identied by one testas the unmarked element of a scale then all other tests will be correlated The threethematic assignments appear to be ranked on a scale by the simplicity and distribu-tion tests as we describe below From this we can conclude that frequency as a thirdcorrelated test should also be ranked by the same scale and we can therefore makepredictions about the expected frequencies of the three thematic assignments

First we note that the specication of an Agent of Causation for transitive unerga-tives (such as race) and unaccusatives (such as melt) indicates a causative constructionCausative constructions relate two events the causing event and the core event de-scribed by the intransitive verb the Agent of Causation is the Agent of the causingevent This double event structure can be considered as more complex than the sin-gle event that is found in a transitive object-drop verb (such as play) (Stevenson andMerlo 1997b) The simplicity test thus indicates that the causative unergatives andunaccusatives are marked in comparison to the transitive object-drop verbs

We further observe that the causative transitive of an unergative verb has an Agentthematic role in object position which is subordinated to the Agent of Causation insubject position yielding an unusual ldquodouble agentiverdquo thematic structure This lex-ical causativization of unergatives (in contrast to analytic causativization) is a distri-butionally rarer phenomenonmdashfound in fewer languagesmdashthan lexical causatives ofunaccusatives In asking native speakers about our verbs we have found that lexicalcausatives of unergative verbs are not attested in Italian French German PortugueseGungbe (Kwa family) and Czech On the other hand the lexical causatives are pos-sible for unaccusative verbs (ie where the object is a Theme) in all these languagesVietnamese appears to allow a very restricted form of causativization of unergativeslimited to only those cases that have a comitative reading The typological distribu-tion test thus indicates that unergatives are more marked than unaccusatives in thetransitive form

From these observations we can conclude that unergatives (such as race) havethe most marked transitive argument structure unaccusatives (such as melt) havean intermediately marked transitive argument structure and object-drops (such asplay) have the least marked transitive argument structure of the three Under theassumptions of markedness theory outlined above we then predict that unergativesare the least frequent in the transitive that unaccusatives have intermediate frequencyin the transitive and that object-drop verbs are the most frequent in the transitive

Causativity Due to the causative alternation of unergatives and unaccusatives thethematic role of the subject of the intransitive is identical to that of the object of thetransitive as shown in the second and third columns of thematic roles in Table 3Given the identity of thematic role mapped to subject and object positions across thetwo alternants we expect to observe the same noun occurring at times as subject ofthe verb and at other times as object of the verb In contrast for object-drop verbsthe thematic role of the subject of the intransitive is identical to that of the subject ofthe transitive not the object of the transitive We therefore expect that it will be lesscommon for the same noun to occur in subject and object position across instances ofthe same object-drop verb

Thus we hypothesize that this pattern of thematic role assignments will be re-ected in a differential amount of usage across the classes of the same nouns as sub-jects and objects for a given verb Generally we would expect that causative verbs(in our case the unergative and unaccusative verbs) would have a greater degree ofoverlap of nouns in subject and object position than non-causative transitive verbs (in

379

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 5: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

we show how the analysis of thematic distinctions enables us to determine lexicalproperties that we hypothesize will exhibit useful detectable frequency differencesin our corpora and thus serve as the machine learning features for our classicationexperiments

21 The Argument Structure DistinctionsThe verb classes are exemplied below in sentences repeated from Table 1 for ease ofexposition

Unergative (1a) The horse raced past the barn

(1b) The jockey raced the horse past the barn

Unaccusative (2a) The butter melted in the pan

(2b) The cook melted the butter in the pan

Object-Drop (3a) The boy played

(3b) The boy played soccer

The example sentences illustrate that all three classes participate in a diathesis alter-nation that relates a transitive and intransitive form of the verb However accordingto Levin (1993) each class exhibits a different type of diathesis alternation which isdetermined by the particular semantic relations of the arguments to the verb We makethese distinctions explicit by drawing on a standard notion of thematic role as eachclass has a distinct pattern of thematic assignments (ie different argument structures)

We assume here that a thematic role is a label taken from a xed inventory ofgrammaticalized semantic relations for example an Agent is the doer of an actionand a Theme is the entity undergoing an event (Gruber 1965) While admitting thatsuch notions as Agent and Theme lack formal denitions (in our work and in theliterature more widely) the distinctions are clear enough to discriminate our threeverb classes For our purposes these roles can simply be thought of as semantic labelswhich are non-decomposable but there is nothing in our approach that rests on thisassumption Thus our approach would also be compatible with a feature-based de-nition of participant roles as long as the features capture such general distinctions asfor example the doer of an action and the entity acted upon (Dowty 1991)

Note that in our focus on verb class distinctions we have not considered ner-grained features that rely on more specic semantic features such as for examplethat the subject of the intransitive melt must be something that can change from solidto liquid While this type of feature may be important for semantic distinctions amongindividual verbs it thus far seems irrelevant to the level of verb classication thatwe adopt which groups verbs more broadly according to syntactic and (somewhatcoarser-grained) semantic properties

Our analysis of thematic assignmentmdashwhich was summarized in Table 2 repeatedhere as Table 3mdashis elaborated here for each verb class The sentences in (1) aboveillustrate the relevant alternants of an unergative verb race Unergatives are intransitiveaction verbs whose transitive form as in (1b) can be the causative counterpart of theintransitive form (1a) The type of causative alternation that unergatives participate inis the ldquoinduced action alternationrdquo according to Levin (1993) For our thematic analysiswe note that the subject of an intransitive activity verb is specied to be an AgentThe subject of the transitive form is indicated by the label Agent of Causation whichindicates that the thematic role assigned to the subject is marked as the role which is

377

Computational Linguistics Volume 27 Number 3

Table 3Summary of thematic assignments

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

introduced with the causing event In a causative alternation the semantic argumentof the subject of the intransitive surfaces as the object of the transitive (Brousseau andRitter 1991 Hale and Keyser 1993 Levin 1993 Levin and Rappaport Hovav 1995) Forunergatives this argument is an Agent and thus the alternation yields an object inthe transitive form that receives an Agent thematic role (Cruse 1972) These thematicassignments are shown in the rst row of Table 3

The sentences in (2) illustrate the corresponding forms of an unaccusative verbmelt Unaccusatives are intransitive change-of-state verbs as in (2a) the transitivecounterpart for these verbs also exhibits a causative alternation as in (2b) This isthe ldquocausativeinchoative alternationrdquo (Levin 1993) Like unergatives the subject ofa transitive unaccusative is marked as the Agent of Causation Unlike unergativesthough the alternating argument of an unaccusative (the subject of the intransitiveform that becomes the object of the transitive) is an entity undergoing a change ofstate without active participation and is therefore a Theme The resulting pattern ofthematic assignments is indicated in the second row of Table 3

The sentences in (3) use an object-drop verb play These are activity verbs thatexhibit a non-causative diathesis alternation in which the object is simply optionalThis is dubbed ldquothe unexpressed object alternationrdquo (Levin 1993) and has severalsubtypes that we do not distinguish here The thematic assignment for these verbs issimply Agent for the subject (in both transitive and intransitive forms) and Themefor the optional object see the last row of Table 3

For further details and support of this analysis please see the discussion in Steven-son and Merlo (1997b) and Merlo and Stevenson (2000b) For our purposes here theimportant fact to note is that each of the three classes can be uniquely identied bythe pattern of thematic assignments across the two alternants of the verbs

22 Features for Automatic ClassicationOur next task then is to derive from these thematic patterns useful features for au-tomatically classifying the verbs In what follows we refer to the columns of Table 3to explain how we expect the thematic distinctions to give rise to distributional prop-erties which when appropriately approximated through corpus counts will discrim-inate across the three classes

Transitivity Consider the rst two columns of thematic roles in Table 3 which illus-trate the role assignment in the transitive construction The Prague schoolrsquos notionof linguistic markedness (Jakobson 1971 Trubetzkoy 1939) enables us to establish ascale of markedness of these thematic assignments and make a principled predictionabout their frequency of occurrence Typical tests to determine the unmarked elementof a pair or scale are simplicitymdashthe unmarked element is simpler distributionmdashtheunmarked member is more widely attested across languages and frequencymdashthe un-

378

Merlo and Stevenson Statistical Verb Classication

marked member is more frequent (Greenberg 1966 Moravcsik and Wirth 1983) Theclaim of markedness theory is that once an element has been identied by one testas the unmarked element of a scale then all other tests will be correlated The threethematic assignments appear to be ranked on a scale by the simplicity and distribu-tion tests as we describe below From this we can conclude that frequency as a thirdcorrelated test should also be ranked by the same scale and we can therefore makepredictions about the expected frequencies of the three thematic assignments

First we note that the specication of an Agent of Causation for transitive unerga-tives (such as race) and unaccusatives (such as melt) indicates a causative constructionCausative constructions relate two events the causing event and the core event de-scribed by the intransitive verb the Agent of Causation is the Agent of the causingevent This double event structure can be considered as more complex than the sin-gle event that is found in a transitive object-drop verb (such as play) (Stevenson andMerlo 1997b) The simplicity test thus indicates that the causative unergatives andunaccusatives are marked in comparison to the transitive object-drop verbs

We further observe that the causative transitive of an unergative verb has an Agentthematic role in object position which is subordinated to the Agent of Causation insubject position yielding an unusual ldquodouble agentiverdquo thematic structure This lex-ical causativization of unergatives (in contrast to analytic causativization) is a distri-butionally rarer phenomenonmdashfound in fewer languagesmdashthan lexical causatives ofunaccusatives In asking native speakers about our verbs we have found that lexicalcausatives of unergative verbs are not attested in Italian French German PortugueseGungbe (Kwa family) and Czech On the other hand the lexical causatives are pos-sible for unaccusative verbs (ie where the object is a Theme) in all these languagesVietnamese appears to allow a very restricted form of causativization of unergativeslimited to only those cases that have a comitative reading The typological distribu-tion test thus indicates that unergatives are more marked than unaccusatives in thetransitive form

From these observations we can conclude that unergatives (such as race) havethe most marked transitive argument structure unaccusatives (such as melt) havean intermediately marked transitive argument structure and object-drops (such asplay) have the least marked transitive argument structure of the three Under theassumptions of markedness theory outlined above we then predict that unergativesare the least frequent in the transitive that unaccusatives have intermediate frequencyin the transitive and that object-drop verbs are the most frequent in the transitive

Causativity Due to the causative alternation of unergatives and unaccusatives thethematic role of the subject of the intransitive is identical to that of the object of thetransitive as shown in the second and third columns of thematic roles in Table 3Given the identity of thematic role mapped to subject and object positions across thetwo alternants we expect to observe the same noun occurring at times as subject ofthe verb and at other times as object of the verb In contrast for object-drop verbsthe thematic role of the subject of the intransitive is identical to that of the subject ofthe transitive not the object of the transitive We therefore expect that it will be lesscommon for the same noun to occur in subject and object position across instances ofthe same object-drop verb

Thus we hypothesize that this pattern of thematic role assignments will be re-ected in a differential amount of usage across the classes of the same nouns as sub-jects and objects for a given verb Generally we would expect that causative verbs(in our case the unergative and unaccusative verbs) would have a greater degree ofoverlap of nouns in subject and object position than non-causative transitive verbs (in

379

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 6: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

Table 3Summary of thematic assignments

Transitive Intransitive

Classes Subject Object Subject

Unergative Agent (of Causation) Agent AgentUnaccusative Agent (of Causation) Theme ThemeObject-Drop Agent Theme Agent

introduced with the causing event In a causative alternation the semantic argumentof the subject of the intransitive surfaces as the object of the transitive (Brousseau andRitter 1991 Hale and Keyser 1993 Levin 1993 Levin and Rappaport Hovav 1995) Forunergatives this argument is an Agent and thus the alternation yields an object inthe transitive form that receives an Agent thematic role (Cruse 1972) These thematicassignments are shown in the rst row of Table 3

The sentences in (2) illustrate the corresponding forms of an unaccusative verbmelt Unaccusatives are intransitive change-of-state verbs as in (2a) the transitivecounterpart for these verbs also exhibits a causative alternation as in (2b) This isthe ldquocausativeinchoative alternationrdquo (Levin 1993) Like unergatives the subject ofa transitive unaccusative is marked as the Agent of Causation Unlike unergativesthough the alternating argument of an unaccusative (the subject of the intransitiveform that becomes the object of the transitive) is an entity undergoing a change ofstate without active participation and is therefore a Theme The resulting pattern ofthematic assignments is indicated in the second row of Table 3

The sentences in (3) use an object-drop verb play These are activity verbs thatexhibit a non-causative diathesis alternation in which the object is simply optionalThis is dubbed ldquothe unexpressed object alternationrdquo (Levin 1993) and has severalsubtypes that we do not distinguish here The thematic assignment for these verbs issimply Agent for the subject (in both transitive and intransitive forms) and Themefor the optional object see the last row of Table 3

For further details and support of this analysis please see the discussion in Steven-son and Merlo (1997b) and Merlo and Stevenson (2000b) For our purposes here theimportant fact to note is that each of the three classes can be uniquely identied bythe pattern of thematic assignments across the two alternants of the verbs

22 Features for Automatic ClassicationOur next task then is to derive from these thematic patterns useful features for au-tomatically classifying the verbs In what follows we refer to the columns of Table 3to explain how we expect the thematic distinctions to give rise to distributional prop-erties which when appropriately approximated through corpus counts will discrim-inate across the three classes

Transitivity Consider the rst two columns of thematic roles in Table 3 which illus-trate the role assignment in the transitive construction The Prague schoolrsquos notionof linguistic markedness (Jakobson 1971 Trubetzkoy 1939) enables us to establish ascale of markedness of these thematic assignments and make a principled predictionabout their frequency of occurrence Typical tests to determine the unmarked elementof a pair or scale are simplicitymdashthe unmarked element is simpler distributionmdashtheunmarked member is more widely attested across languages and frequencymdashthe un-

378

Merlo and Stevenson Statistical Verb Classication

marked member is more frequent (Greenberg 1966 Moravcsik and Wirth 1983) Theclaim of markedness theory is that once an element has been identied by one testas the unmarked element of a scale then all other tests will be correlated The threethematic assignments appear to be ranked on a scale by the simplicity and distribu-tion tests as we describe below From this we can conclude that frequency as a thirdcorrelated test should also be ranked by the same scale and we can therefore makepredictions about the expected frequencies of the three thematic assignments

First we note that the specication of an Agent of Causation for transitive unerga-tives (such as race) and unaccusatives (such as melt) indicates a causative constructionCausative constructions relate two events the causing event and the core event de-scribed by the intransitive verb the Agent of Causation is the Agent of the causingevent This double event structure can be considered as more complex than the sin-gle event that is found in a transitive object-drop verb (such as play) (Stevenson andMerlo 1997b) The simplicity test thus indicates that the causative unergatives andunaccusatives are marked in comparison to the transitive object-drop verbs

We further observe that the causative transitive of an unergative verb has an Agentthematic role in object position which is subordinated to the Agent of Causation insubject position yielding an unusual ldquodouble agentiverdquo thematic structure This lex-ical causativization of unergatives (in contrast to analytic causativization) is a distri-butionally rarer phenomenonmdashfound in fewer languagesmdashthan lexical causatives ofunaccusatives In asking native speakers about our verbs we have found that lexicalcausatives of unergative verbs are not attested in Italian French German PortugueseGungbe (Kwa family) and Czech On the other hand the lexical causatives are pos-sible for unaccusative verbs (ie where the object is a Theme) in all these languagesVietnamese appears to allow a very restricted form of causativization of unergativeslimited to only those cases that have a comitative reading The typological distribu-tion test thus indicates that unergatives are more marked than unaccusatives in thetransitive form

From these observations we can conclude that unergatives (such as race) havethe most marked transitive argument structure unaccusatives (such as melt) havean intermediately marked transitive argument structure and object-drops (such asplay) have the least marked transitive argument structure of the three Under theassumptions of markedness theory outlined above we then predict that unergativesare the least frequent in the transitive that unaccusatives have intermediate frequencyin the transitive and that object-drop verbs are the most frequent in the transitive

Causativity Due to the causative alternation of unergatives and unaccusatives thethematic role of the subject of the intransitive is identical to that of the object of thetransitive as shown in the second and third columns of thematic roles in Table 3Given the identity of thematic role mapped to subject and object positions across thetwo alternants we expect to observe the same noun occurring at times as subject ofthe verb and at other times as object of the verb In contrast for object-drop verbsthe thematic role of the subject of the intransitive is identical to that of the subject ofthe transitive not the object of the transitive We therefore expect that it will be lesscommon for the same noun to occur in subject and object position across instances ofthe same object-drop verb

Thus we hypothesize that this pattern of thematic role assignments will be re-ected in a differential amount of usage across the classes of the same nouns as sub-jects and objects for a given verb Generally we would expect that causative verbs(in our case the unergative and unaccusative verbs) would have a greater degree ofoverlap of nouns in subject and object position than non-causative transitive verbs (in

379

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 7: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

marked member is more frequent (Greenberg 1966 Moravcsik and Wirth 1983) Theclaim of markedness theory is that once an element has been identied by one testas the unmarked element of a scale then all other tests will be correlated The threethematic assignments appear to be ranked on a scale by the simplicity and distribu-tion tests as we describe below From this we can conclude that frequency as a thirdcorrelated test should also be ranked by the same scale and we can therefore makepredictions about the expected frequencies of the three thematic assignments

First we note that the specication of an Agent of Causation for transitive unerga-tives (such as race) and unaccusatives (such as melt) indicates a causative constructionCausative constructions relate two events the causing event and the core event de-scribed by the intransitive verb the Agent of Causation is the Agent of the causingevent This double event structure can be considered as more complex than the sin-gle event that is found in a transitive object-drop verb (such as play) (Stevenson andMerlo 1997b) The simplicity test thus indicates that the causative unergatives andunaccusatives are marked in comparison to the transitive object-drop verbs

We further observe that the causative transitive of an unergative verb has an Agentthematic role in object position which is subordinated to the Agent of Causation insubject position yielding an unusual ldquodouble agentiverdquo thematic structure This lex-ical causativization of unergatives (in contrast to analytic causativization) is a distri-butionally rarer phenomenonmdashfound in fewer languagesmdashthan lexical causatives ofunaccusatives In asking native speakers about our verbs we have found that lexicalcausatives of unergative verbs are not attested in Italian French German PortugueseGungbe (Kwa family) and Czech On the other hand the lexical causatives are pos-sible for unaccusative verbs (ie where the object is a Theme) in all these languagesVietnamese appears to allow a very restricted form of causativization of unergativeslimited to only those cases that have a comitative reading The typological distribu-tion test thus indicates that unergatives are more marked than unaccusatives in thetransitive form

From these observations we can conclude that unergatives (such as race) havethe most marked transitive argument structure unaccusatives (such as melt) havean intermediately marked transitive argument structure and object-drops (such asplay) have the least marked transitive argument structure of the three Under theassumptions of markedness theory outlined above we then predict that unergativesare the least frequent in the transitive that unaccusatives have intermediate frequencyin the transitive and that object-drop verbs are the most frequent in the transitive

Causativity Due to the causative alternation of unergatives and unaccusatives thethematic role of the subject of the intransitive is identical to that of the object of thetransitive as shown in the second and third columns of thematic roles in Table 3Given the identity of thematic role mapped to subject and object positions across thetwo alternants we expect to observe the same noun occurring at times as subject ofthe verb and at other times as object of the verb In contrast for object-drop verbsthe thematic role of the subject of the intransitive is identical to that of the subject ofthe transitive not the object of the transitive We therefore expect that it will be lesscommon for the same noun to occur in subject and object position across instances ofthe same object-drop verb

Thus we hypothesize that this pattern of thematic role assignments will be re-ected in a differential amount of usage across the classes of the same nouns as sub-jects and objects for a given verb Generally we would expect that causative verbs(in our case the unergative and unaccusative verbs) would have a greater degree ofoverlap of nouns in subject and object position than non-causative transitive verbs (in

379

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 8: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

our case the object-drop verbs) However since the causative is a transitive use andthe transitive use of unergatives is expected to be rare (see above) we do not expectunergatives to exhibit a high degree of detectable overlap in a corpus Thus this over-lap of subjects and objects should primarily distinguish unaccusatives (predicted tohave high overlap of subjects and objects) from the other two classes (each of whichis predicted to have low [detectable] overlap of subjects and objects)

Animacy Finally considering the roles in the rst and last columns of thematic assign-ments in Table 3 we observe that unergative and object-drop verbs assign an agentiverole to their subject in both the transitive and intransitive while unaccusatives assignan agentive role to their subject only in the transitive Under the assumption that theintransitive use of unaccusatives is not rare we then expect that unaccusatives willoccur less often overall with an agentive subject than will the other two verb classes(The assumption that unaccusatives are not rare in the intransitive is based on thelinguistic complexity of the causative transitive alternant and is borne out in our cor-pus analysis) On the further assumption that Agents tend to be animate entities moreso than Themes are we expect that unaccusatives will occur less frequently with ananimate subject compared to unergative and object-drop verbs Note the importanceof our use of frequency distributions the claim is not that only Agents can be animatebut rather that nouns that receive an Agent role will more often be animate than nounsthat receive a Theme role

Additional Features The above interactions between thematic roles and the syntacticexpressions of arguments thus lead to three features whose distributional propertiesappear promising for distinguishing unergative unaccusative and object-drop verbstransitivity causativity and animacy of subject We also investigate two additional syn-tactic features the use of the passive or active voice and the use of the past participle orsimple past part-of-speech (POS) tag (VBN or VBD in the Penn Treebank style) Thesefeatures are related to the transitiveintransitive alternation since a passive use impliesa transitive use of the verb as well as to the use of a past participle form of the verb2

Table 4 summarizes the features we derive from the thematic properties and ourexpectations concerning their frequency of use We hypothesize that these ve featureswill exhibit distributional differences in the observed usages of the verbs that can beused for classication In the next section we describe the actual corpus counts that wedevelop to approximate the features we have identied (Notice that the counts willbe imperfect approximations to the thematic knowledge beyond the inevitable errorsdue to automatic extraction from large automatically annotated corpora Even whenthe counts are precise they only constitute an approximation to the actual thematicnotions since the features we are using are not logically implied by the knowledgewe want to capture but only statistically correlated)

3 Data Collection and Analysis

Clearly some of the features wersquove proposed are difcult (eg the passive use) or im-possible (eg animate subject use) to automatically extract with high accuracy from a

2 For our sample verbs the statistical correlation between the transitive and passive features is highlysignicant (N 59 R 44 p 001) as is the correlation between the transitive and past participlefeatures (N 59 R 36 p 005) (Since as explained in the next section our features are expressedas proportionsmdasheg percent transitive use out of detected transitive and intransitive usemdashcorrelationsof intransitivity with passive or past participle use have the same magnitude but are negative)

380

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 9: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

Table 4The features and expected behavior

Expected FrequencyFeature Pattern Explanation

Transitivity Unerg Unacc ObjDrop Unaccusatives and unergatives have a causativetransitive hence lower transitive use Further-more unergatives have an agentive object hencevery low transitive use

Causativity Unerg ObjDrop Unacc Object-drop verbs do not have a causal agenthence low ldquocausativerdquo use Unergatives are rarein the transitive hence low causative use

Animacy Unacc Unerg ObjDrop Unaccusatives have a Theme subject in the in-transitive hence lower use of animate subjects

Passive Voice Unerg Unacc ObjDrop Passive implies transitive use hence correlatedwith transitive feature

VBN Tag Unerg Unacc ObjDrop Passive implies past participle use (VBN) hencecorrelated with transitive (and passive)

large corpus given the current state of annotation However we do assume that cur-rently available corpora such as the Wall Street Journal (WSJ) provide a representativeand large enough sample of language from which to gather corpus counts that can ap-proximate the distributional patterns of the verb class alternations Our work draws ontwo text corporamdashone an automatically tagged combined corpus of 65 million words(primarily WSJ) the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the rst corpus) Using these corpora we develop countingprocedures that yield relative frequency distributions for approximations to the velinguistic features we have determined over a sample of verbs from our three classes

31 Materials and MethodWe chose a set of 20 verbs from each class based primarily on the classication in Levin(1993)3 The complete list of verbs appears in Table 5 the group 1group 2 designationis explained below in the section on counting As indicated in the table unergativesare manner-of-motion verbs (from the ldquorunrdquo class in Levin) unaccusatives are change-of-state verbs (from several of the classes in Levinrsquos change-of-state super-class) whileobject-drop verbs were taken from a variety of classes in Levinrsquos classication all ofwhich undergo the unexpressed object alternation The most frequently used classesare verbs of change of possession image-creation verbs and verbs of creation andtransformation The selection of verbs was based partly on our intuitive judgmentthat the verbs were likely to be used with sufcient frequency in the WSJ Also each

3 We used an equal number of verbs from each class in order to have a balanced group of items Onepotential disadvantage of this decision is that each verb class is represented equally even though theymay not be equally frequent in the corpora Although we lose the relative frequency informationamong the classes that could provide a better bias for assigning a default classication (ie the mostfrequent one) we have the advantage that our classier will be equally informed (in terms of numberof exemplars) about each class

Note that there are only 19 unaccusative verbs because ripped which was initially counted in theunaccusatives was then excluded from the analysis as it occurred mostly in a very different usage inthe corpus (as verb particle in ripped off ) from the intended optionally intransitive usage

381

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 10: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

Table 5Verbs used in the experiments

Class Name Description Selected Verbs

Unergative manner of motion jumped rushed marched leaped oated raced hurried wan-dered vaulted paraded (group 1) galloped glided hikedhopped jogged scooted scurried skipped tiptoed trotted(group 2)

Unaccusative change of state opened exploded ooded dissolved cracked hardened boiledmelted fractured solidied (group 1) collapsed cooledfolded widened changed cleared divided simmered stabi-lized (group 2)

Object-Drop unexpressedobject alternation

played painted kicked carved reaped washed dancedyelled typed knitted (group 1) borrowed inherited orga-nized rented sketched cleaned packed studied swallowedcalled (group 2)

verb presents the same form in the simple past and in the past participle (the reg-ular ldquo-edrdquo form) In order to simplify the counting procedure we included only theldquo-edrdquo form of the verb on the assumption that counts on this single verb form wouldapproximate the distribution of the features across all forms of the verb Additionallyas far as we were able given the preceding constraints we selected verbs that couldoccur in the transitive and in the passive Finally we aimed for a frequency cut-offof 10 occurrences or more for each verb although for unergatives we had to use oneverb (jogged) that only occurred 8 times in order to have 20 verbs that satised theother criteria above

In performing this kind of corpus analysis one has to recognize the fact thatcurrent corpus annotations do not distinguish verb senses In these counts we didnot distinguish a core sense of the verb from an extended use of the verb So forinstance the sentence Consumer spending jumped 17 in February after a sharp drop themonth before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jumpin its intransitive form This particular sense extension has a transitive alternant butnot a causative transitive (ie Consumer spending jumped the barrier but not Low taxesjumped consumer spending ) Thus while the possible subcategorizations remain thesame rates of transitivity and causativity may be different than for the literal manner-of-motion sense This is an unavoidable result of using simple automatic extractionmethods given the current state of annotation of corpora

For each occurrence of each verb we counted whether it was in a transitive orintransitive use (trans) in a passive or active use (pass) in a past participle or simplepast use (vbn) in a causative or non-causative use (caus) and with an animate subjector not (anim)4 Note that except for the vbn feature for which we simply extract thePOS tag from the corpus all other counts are approximations to the actual linguisticbehaviour of the verb as we describe in detail below

4 One additional feature was recordedmdashthe log frequency of the verb in the 65 million wordcorpusmdashmotivated by the conjecture that the frequency of a verb may help in predicting its class Inour machine learning experiments however this conjecture was not borne out as the frequency featuredid not improve performance This is the case for experiments on all of the verbs as well as forseparate experiments on the group 1 verbs (which were matched across the classes for frequency) andthe group 2 verbs (which were not) We therefore limit discussion here to the thematically-motivatedfeatures

382

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 11: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

The rst three counts (trans pass vbn) were performed on the tagged ACLDCIcorpus available from the Linguistic Data Consortium which includes the Brown Cor-pus (of one million words) and years 1987ndash1989 of the Wall Street Journal a combinedcorpus in excess of 65 million words The counts for these features proceeded as fol-lows

deg trans A number a pronoun a determiner an adjective or a noun wereconsidered to be indication of a potential object of the verb A verboccurrence preceded by forms of the verb be or immediately followed bya potential object was counted as transitive otherwise the occurrencewas counted as intransitive (specically if the verb was followed by apunctuation signmdashcommas colons full stopsmdashor by a conjunction aparticle a date or a preposition)

deg pass A main verb (ie tagged VBD) was counted as active A tokenwith tag VBN was also counted as active if the closest precedingauxiliary was have while it was counted as passive if the closestpreceding auxiliary was be

deg vbn The counts for VBNVBD were simply done based on the POSlabel within the tagged corpus

Each of the above three counts was normalized over all occurrences of the ldquo-edrdquo formof the verb yielding a single relative frequency measure for each verb for that featureie percent transitive (versus intransitive) use percent active (versus passive) use andpercent VBN (versus VBD) use respectively

The last two counts (caus and anim) were performed on a parsed version of the1988 year of the Wall Street Journal so that we could extract subjects and objects ofthe verbs more accurately This corpus of 29 million words was provided to us byMichael Collins and was automatically parsed with the parser described in Collins(1997)5 The counts and their justication are described here

deg caus As discussed above the object of a causative transitive is the samesemantic argument of the verb as the subject of the intransitive Thecausative feature was approximated by the following steps intended tocapture the degree to which the subject of a verb can also occur as itsobject Specically for each verb occurrence the subject and object (ifthere was one) were extracted from the parsed corpus The observedsubjects across all occurrences of the verb were placed into one multisetof nouns and the observed objects into a second multiset of nouns (Amultiset or bag was used so that our representation indicated thenumber of times each noun was used as either subject or object) Thenthe proportion of overlap between the two multisets was calculated Wedene overlap as the largest multiset of elements belonging to both the

5 Readers might be concerned about the portability of this method to languages for which no largeparsed corpus is available It is possible that using a fully parsed corpus is not necessary Our resultswere replicated in English without the need for a fully parsed corpus (Anoop Sarkar pc citing aproject report by Wootiporn Tripasai) Our method was applied to 23 million words of the WSJ thatwere automatically tagged with Ratnaparkhirsquos maximum entropy tagger (Ratnaparkhi 1996) andchunked with the partial parser CASS (Abney 1996) The results are very similar to ours (best accuracy666) suggesting that a more accurate tagger than the one used on our corpus might in fact besufcient to overcome the fact that no full parse is available

383

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 12: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

subject and the object multisets eg the overlap between fa a a bg andfag is fa a ag The proportion is the ratio between the cardinality of theoverlap multiset and the sum of the cardinality of the subject and objectmultisets For example for the simple sets of characters above the ratiowould be 35 yielding a value of 60 for the caus feature

deg anim A problem with a feature like animacy is that it requires eithermanual determination of the animacy of extracted subjects or referenceto an on-line resource such as WordNet for determining animacy Toapproximate animacy with a feature that can be extracted automaticallyand without reference to a resource external to the corpus we takeadvantage of the well-attested animacy hierarchy according to whichpronouns are the most animate (Silverstein 1976 Dixon 1994) Thehypothesis is that the words I we you she he and they most often referto animate entities This hypothesis was conrmed by extracting100 occurrences of the pronoun they which can be either animate orinanimate from our 65 million word corpus The occurrencesimmediately preceded a verb After eliminating repetitions94 occurrences were left which were classied by hand yielding 71animate pronouns 11 inanimate pronouns and 12 unclassiedoccurrences (for lack of sufcient context to recover the antecedent of thepronoun with certainty) Thus at least 76 of usages of they wereanimate we assume the percentage of animate usages of the otherpronouns to be even higher Since the hypothesis was conrmed wecount pronouns (other than it) in subject position (Kariaeva [1999] cfAone and McKee [1996]) The values for the feature were determined byautomatically extracting all subjectverb tuples including our 59 exampleverbs from the parsed corpus and computing the ratio of occurrences ofpronoun subjects to all subjects for each verb

Finally as indicated in Table 5 the verbs are designated as belonging to ldquogroup 1rdquoor ldquogroup 2rdquo All the verbs are treated equally in our data analysis and in the machinelearning experiments but this designation does indicate a difference in details of thecounting procedures described above The verbs in group 1 had been used in an earlierstudy in which it was important to minimize noisy data (Stevenson and Merlo 1997a)so they generally underwent greater manual intervention in the counts In addinggroup 2 for the classication experiment we chose to minimize the intervention inorder to demonstrate that the classication process is robust enough to withstand theresulting noise in the data

For group 2 the transitivity voice and VBN counts were done automatically with-out any manual intervention For group 1 these three counts were done automaticallyby regular expression patterns and then subjected to correction partly by hand andpartly automatically by one of the authors For transitivity the adjustments vary forthe individual verbs Most of the reassignments from a transitive to an intransitivelabelling occurred when the following noun was not the direct object but rather ameasure phrase or a date Most of the reassignments from intransitive to transitiveoccurred when a particle or a preposition following the verb did not introduce a prepo-sitional phrase but instead indicated a passive form (by) or was part of a phrasal verbSome verbs were mostly used adjectivally in which case they were excluded from thetransitivity counts For voice the required adjustments included cases of coordinationof the past participle when the verb was preceded by a conjunction or a comma

384

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 13: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

Table 6Aggregated relative frequency data for the ve features E = unergatives A = unaccusativesO = object-drops

trans pass vbn caus anim

Class N Mean SD Mean SD Mean SD Mean SD Mean SD

E 20 023 023 007 012 021 026 000 000 025 024A 19 040 024 033 027 065 027 012 014 007 009O 20 062 025 031 026 065 023 004 007 015 014

These were collected and classied by hand as passive or active based on intuitionSimilarly partial adjustments to the VBN counts were made by hand

For the causativity feature subjects and objects were determined by manual in-spection of the corpus for verbs belonging to group 1 while they were extractedautomatically from the parsed corpus for group 2 The group 1 verbs were sampledin three ways depending on total frequency For verbs with less than 150 occurrencesall instances of the verbs were used for subjectobject extraction For verbs whosetotal frequency was greater than 150 but whose VBD frequency was in the range100ndash200 we extracted subjects and objects of the VBD occurrences only For higherfrequency verbs we used only the rst 100 VBD occurrences6 The same script forcomputing the overlap of the extracted subjects and objects was then used on theresulting subjectverb and verbobject tuples for both group 1 and group 2 verbs

The animacy feature was calculated over subjectverb tuples extracted automati-cally for both groups of verbs from the parsed corpus

32 Data AnalysisThe data collection described above yields the following data points in total trans27403 pass 20481 vbn 36297 caus 11307 anim 7542 (Different features yield differ-ent totals because they were sampled independently and the search patterns to extractsome features are more imprecise than others) The aggregate means by class of thenormalized frequencies for all verbs are shown in Table 6 item by item distributionsare provided in Appendix A and raw counts are available from the authors Notethat aggregate means are shown for illustration purposes onlymdashall machine learningexperiments are performed on the individual normalized frequencies for each verb asgiven in Appendix A

The observed distributions of each feature are indeed roughly as expected accord-ing to the description in Section 2 Unergatives show a very low relative frequency ofthe trans feature followed by unaccusatives then object-drop verbs Unaccusativeverbs show a high frequency of the caus feature and a low frequency of the anim fea-ture compared to the other classes Somewhat unexpectedly object-drop verbs exhibita non-zero mean caus value (almost half the verbs have a caus value greater thanzero) leading to a three-way causative distinction among the verb classes We suspectthat the approximation that we used for causative usemdashthe overlap between subjects

6 For this last set of high-frequency verbs (exploded jumped opened played rushed) we used the rst100 occurrences as the simplest way to collect the sample In response to an anonymous reviewerrsquosconcern we later veried that these counts were not different from counts obtained by randomsampling of 100 VBD occurrences A paired t-test of the two sets of counts (rst 100 sampling andrandom sampling) indicates that the two sets of counts are not statistically different (t 1 283 DF 4p 0 2687)

385

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 14: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

Table 7Manually (Man) and automatically (Aut) calculated features for a random sample of verbsT = trans P = pass V = vbn C = caus A = anim

Unergative Unaccusative Object-Drop

hopped scurried folded stabilized inherited swallowed

Man Aut Man Aut Man Aut Man Aut Man Aut Man Aut

T 021 021 000 000 071 023 024 018 100 064 096 035P 000 000 000 000 044 033 019 013 039 013 054 044V 003 000 010 000 056 073 071 092 056 060 064 079C 000 000 000 000 054 000 024 035 000 006 000 004A 093 100 090 014 023 000 002 000 058 032 035 022

and objects for a verbmdashalso captures a ldquoreciprocityrdquo effect for some object-drop verbs(such as call) in which subjects and objects can be similar types of entities Finallyalthough expected to be a redundant indicator of transitivity pass and vbn unliketrans have very similar values for unaccusative and object-drop verbs indicatingthat their distributions are sensitive to factors we have not yet investigated

One issue we must address is how precisely the automatic counts reect the actuallinguistic behaviour of the verbs That is we must be assured that the patterns we notein the data in Table 6 are accurate reections of the differential behaviour of the verbclasses and not an artifact of the way in which we estimate the features or a result ofinaccuracies in the counts In order to evaluate the accuracy of our feature counts weselected two verbs from each class and determined the ldquotruerdquo value of each featurefor each of those six verbs through manual counting The six verbs were randomlyselected from the group 2 subset of the verbs since counts for group 2 verbs (asexplained above) had not undergone manual correction This allows us to determinethe accuracy of the fully automatic counting procedures The selected verbs (and theirfrequencies) are hopped (29) scurried (21) folded (189) stabilized (286) inherited (357)swallowed (152) For verbs that had a frequency of over 100 in the ldquo-edrdquo form weperformed the manual counts on the rst 100 occurrences

Table 7 shows the results of the manual counts reported as proportions to facil-itate comparison to the normalized automatic counts shown in adjoining columnsWe observe rst that overall most errors in the automatic counts occur in the unac-cusative and object-drop verbs While tagging errors affect the vbn feature for all ofthe verbs somewhat we note that trans and pass are consistently underestimated forunaccusative and object-drop verbs These errors make the unaccusative and object-drop feature values more similar to each other and therefore potentially harder todistinguish Furthermore because the trans and pass values are underestimated bythe automatic counts and therefore lower in value they are also closer to the valuesfor the unergative verbs For the caus feature we predict the highest values for theunaccusative verbs and while that prediction is conrmed the automatic counts forthat class also show the most errors Finally although the general pattern of highervalues for the anim feature of unergatives and object-drop verbs is preserved in theautomatic counts the feature is underestimated for almost all the verbs again makingthe values for that feature closer across the classes than they are in reality

We conclude that although there are inaccuracies in all the counts the generalpatterns expected based on our analysis of the verb classes hold in both the manualand automatic counts Errors in the estimating and counting procedures are therefore

386

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 15: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

not likely to be responsible for the pattern of data in Table 6 above which generallymatches our predictions Furthermore the errors at least for this random sample ofverbs occur in a direction that makes our task of distinguishing the classes moredifcult and indicates that developing more accurate search patterns may possiblysharpen the class distinctions and improve the classication performance

4 Experiments in Classication

In this section we turn to our computational experiments that investigate whether thestatistical indicators of thematic properties that we have developed can in fact be usedto classify verbs Recall that the task we have set ourselves is that of automaticallylearning the best class for a set of usages of a verb as opposed to classifying individualoccurrences of the verb The frequency distributions of our features yield a vector foreach verb that represents the estimated values for the verb on each dimension acrossthe entire corpus

Vector template [verb-name trans pass vbn caus anim class]

Example [opened 69 09 21 16 36 unacc]

The resulting set of 59 vectors constitutes the data for our machine learning experi-ments We use this data to train an automatic classier to determine given the featurevalues for a new verb (not from the training set) which of the three major classes ofEnglish optionally intransitive verbs it belongs to

41 Experimental MethodologyIn pilot experiments on a subset of the features we investigated a number of su-pervised machine learning methods that produce automatic classiers (decision treeinduction rule learning and two types of neural networks) as well as hierarchi-cal clustering see Stevenson et al (1999) for more detail Because we achieved ap-proximately the same level of performance in all cases we narrowed our furtherexperimentation to the publicly available version of the C50 machine learning system(httpwwwrulequestcom) a newer version of C45 (Quinlan 1992) due to its easeof use and wide availability The C50 system generates both decision trees and cor-responding rule sets from a training set of known classications In our experimentswe found little to no difference in performance between the trees and rule sets andreport only the rule set results

In the experiments below we follow two methodologies in training and testingeach of which tests a subset of cases held out from the training data Thus in all casesthe results we report are on test data that was never seen in training7

The rst training and testing methodology we follow is 10-fold cross-validation Inthis approach the system randomly divides the data into ten parts and runs ten timeson a different 90-training-data10-test-data split yielding an average accuracy andstandard error across the ten test sets This training methodology is very useful for

7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by theauthors prior to nalizing the specic features to count However this does not reduce the generalityof our results The features we use are motivated by linguistic theory and derived from the set ofthematic properties that discriminate the verb classes It is therefore very unlikely that they are skewedto the particular verbs we have chosen Furthermore our cross-validation experiments described in thenext subsection show that our results hold across a very large number of randomly selected subsets ofthis sample of verbs

387

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 16: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

our application as it yields performance measures across a large number of trainingdatatest data sets avoiding the problems of outliers in a single random selectionfrom a relatively small data set such as ours

The second methodology is a single hold-out training and testing approach Herethe system is run N times where N is the size of the data set (ie the 59 verbs inour case) each time holding out a single data vector as the test case and using theremaining N-1 vectors as the training set The single hold-out methodology yields anoverall accuracy rate (when the results are averaged across all N trials) but alsomdashunlike cross-validationmdashgives us classication results on each individual data vectorThis property enables us to analyze differential performance on the individual verbsand across the different verb classes

Under both training and testing methodologies the baseline (chance) performancein this taskmdasha three-way classicationmdashis 339 In the single hold-out methodologythere are 59 test cases with 20 19 and 20 verbs each from the unergative unaccusativeand object-drop classes respectively Chance performance of picking a single class labelas a default and assigning it to all cases would yield at most 20 out of the 59 casescorrect or 339 For the cross-validation methodology the determination of a baselineis slightly more complex as we are testing on a random selection of 10 of the fulldata set in each run The 339 gure represents the expected relative proportion of atest set that would be labelled correctly by assignment of a default class label to theentire test set Although the precise make-up of the test cases vary on average the testset will represent the class membership proportions of the entire set of verbs Thusas with the single hold-out approach chance accuracy corresponds to a maximum of2059 or 339 of the test set being labelled correctly

The theoretical maximum accuracy for the task is of course 100 although inSection 5 we discuss some classication results from human experts that indicate thata more realistic expectation is much lower (around 87)

42 Results Using 10-Fold Cross-ValidationWe rst report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times This means that the 10-fold cross-validation procedure isrepeated for 50 different random divisions of the data The numbers reported are theaverages of the results over all the trials That is the average accuracy and standarderror from each random division of the data (a single cross-validation run including10 training and test sets) are averaged across the 50 different random divisions Thislarge number of experimental trials gives us a very tight bound on the mean accuracyreported enabling us to determine with high condence the statistical signicance ofdifferences in results

Table 8 shows that performance of classication using individual features variesgreatly from little above the baseline to almost 22 above the baseline or a reductionof a third of the error rate a very good result for a single feature (All reportedaccuracies in Table 8 are statistically distinct at the p lt 01 level using an ANOVA[df = 249 F = 33472] with a Tukey-Kramer post test)

The rst line of Table 9 shows that the combination of all features achieves anaccuracy of 698 which is 359 over the baseline for a reduction in the error rate of54 This is a rather considerable result given the very low baseline (339) Moreoverrecall that our training and testing sets are always disjoint (cf Lapata and Brew [1999]Siegel [1999]) in other words we are predicting the classication of verbs that werenever seen in the training corpus the hardest situation for a classication algorithm

The second through sixth lines of Table 9 show the accuracy achieved on eachsubset of features that results from removing a single feature This allows us to evaluate

388

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 17: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

Table 8Percent accuracy and standard error of the verb classication task using each featureindividually under a training methodology of 10-fold cross-validation repeated 50 times

Feature Accuracy SE

caus 557 1

vbn 525 5

pass 502 5

trans 471 4

anim 353 5

Table 9Percent accuracy and standard error of the verb classication task using features incombination under a training methodology of 10-fold cross-validation repeated 50 times

FeatureFeatures Used Not Used Accuracy SE

1 trans pass vbn caus anim 698 52 trans vbn caus anim pass 698 53 trans pass vbn anim caus 673 64 trans pass caus anim vbn 665 55 trans pass vbn caus anim 632 66 pass vbn caus anim trans 616 6

the contribution of each feature to the performance of the classication process bycomparing the performance of the subset without it to the performance using the fullset of features We see that the removal of pass (second line) has no effect on the resultswhile removal of the remaining features yields a 2ndash8 decrease in performance (InTable 9 the differences between all reported accuracies are statistically signicant atthe p lt 05 level except for between lines 1 and 2 lines 3 and 4 and lines 5 and 6using an ANOVA [df = 299 F = 3752] with a Tukey-Kramer post test) We observethat the behavior of the features in combination cannot be predicted by the individualfeature behavior For example caus which is the best individually does not greatlyaffect accuracy when combined with the other features (compare line 3 to line 1)Conversely anim and trans which do not classify verbs accurately when used aloneare the most relevant in a combination of features (compare lines 5 and 6 to line 1) Weconclude that experimentation with combinations of features is required to determinethe relevance of individual features to the classication task

The general behaviour in classication based on individual features and on size4 and size 5 subsets of features is conrmed for all subsets Appendix B reports theresults for all subsets of feature combinations in order of decreasing performanceTable 10 summarizes this information In the rst data column the table illustratesthe average accuracy across all subsets of each size The second through sixth datacolumns report the average accuracy of all the size n subsets in which each featureoccurs For example the second data cell in the second row (549) indicates the averageaccuracy of all subsets of size 2 that contain the feature vbn The last row of thetable indicates the average accuracy for each feature of all subsets containing thatfeature

389

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 18: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

Table 10Average percent accuracy of feature subsets by subset size and by sets of each size includingeach feature

Mean Accuracy of SubsetsSubset Mean Accuracy that Include Each Feature

Size by Subset Size vbn pass trans anim caus

1 482 525 502 471 353 5572 551 549 528 564 580 5763 605 601 585 623 611 6054 657 655 647 667 663 6535 698 698 698 698 698 698

Mean AccFeature 606 592 605 581 618

The rst observationmdashthat more features perform bettermdashis conrmed overallin all subsets Looking at the rst data column of Table 10 we can observe that onaverage larger sets of features perform better than smaller sets Furthermore as can beseen in the following individual feature columns individual features perform better ina bigger set than in a smaller set without exception The second observationmdashthat theperformance of individual features is not always a predictor of their performance incombinationmdashis conrmed by comparing the average performance of each feature insubsets of different sizes to the average across all subsets of each size We can observefor instance that the feature caus which performs very well alone is average infeature combinations of size 3 or 4 By contrast the feature anim which is the worstif used alone is very effective in combination with above average performance for allsubsets of size 2 or greater

43 Results Using Single Hold-Out MethodologyOne of the disadvantages of the cross-validation training methodology which aver-ages performance across a large number of random test sets is that we do not haveperformance data for each verb nor for each class of verbs In another set of experi-ments we used the same C50 system but employed a single hold-out training andtesting methodology In this approach we hold out a single verb vector as the test caseand train the system on the remaining 58 cases We then test the resulting classier onthe single hold-out case and record the assigned class for that verb This procedureis repeated for each of the 59 verbs As noted above the single hold-out methodologyhas the benet of yielding both classication results on each individual verb and anoverall accuracy rate (the average results across all 59 trials) Moreover the results onindividual verbs provide the data necessary for determining accuracy for each verbclass This allows us to determine the contribution of individual features as above butwith reference to their effect on the performance of individual classes This is impor-tant as it enables us to evaluate our hypotheses concerning the relation between thethematic features and verb class distinctions which we turn to in Section 44

We performed single hold-out experiments on the full set of features as well as oneach subset of features with a single feature removed The rst line of Table 11 showsthat the overall accuracy for all ve features is almost exactly the same as that achievedwith the 10-fold cross-validation methodology (695 versus 698) As with the cross-validation results the removal of pass does not degrade performancemdashin fact here itsremoval appears to improve performance (see line 2 of Table 11) However it shouldbe noted that this increase in performance results from one additional verb being

390

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 19: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

Table 11Percent accuracy of the verb classication task using features in combination under a singlehold-out training methodology

Feature AccuracyFeatures Used Not Used on All Verbs

1 trans pass vbn caus anim 6952 trans vbn caus anim pass 7123 trans pass vbn anim caus 6274 trans pass caus anim vbn 6105 trans pass vbn caus anim 6106 pass vbn caus anim trans 644

Table 12F score of classication within each class under a single hold-out training methodology

Feature F score () F score () F score ()Features Used Not Used for Unergs for Unaccs for Objdrops

1 trans pass vbn caus anim 739 686 6492 trans vbn caus anim pass 762 757 6163 trans pass vbn anim caus 651 600 6284 trans pass caus anim vbn 667 650 5135 trans pass vbn caus anim 727 470 6006 pass vbn caus anim trans 781 515 619

classied correctly The remaining lines of Table 11 show that the removal of any otherfeature has a 5ndash8 negative effect on performance again similar to the cross-validationresults (Although note that the precise accuracy achieved is not the same in each caseas with 10-fold cross-validation indicating that there is some sensitivity to the precisemake-up of the training set when using a subset of the features)

Table 12 presents the results of the single hold-out experiments in terms of per-formance within each class using an F measure with balanced precision and recall8

The rst line of the table shows clearly that using all ve features the unergativesare classied with greater accuracy (F = 739) than the unaccusative and object-dropverbs (F scores of 686 and 649 respectively) The features appear to be betterat distinguishing unergatives than the other two verb classes The remaining lines ofTable 12 show that this pattern holds for all of the subsets of features as well Clearlyfuture work on our verb classication task will need to focus on determining featuresthat better discriminate unaccusative and object-drop verbs

One potential explanation that we can exclude is that the pattern of results is duesimply to the frequencies of the verbsmdashthat is that more frequent verbs are more ac-curately classied We examined the relation between classication accuracy and log

8 For all previous results we reported an accuracy measure (the percentage of correct classications outof all classications) Using the terminology of true or false positivesnegatives this is the same astruePositives truePositives falseNegatives In the earlier results there are no falsePositives ortrueNegatives since we are only considering for each verb whether it is correctly classied(truePositive) or not (falseNegative) However when we turn to analyzing the data for each class thepossibility arises of having falsePositives and trueNegatives for that class Hence here we use thebalanced F score which calculates an overall measure of performance as 2PR P R in which P(precision) is truePositives truePositives falsePositives and R (recall) istruePositives truePositives falseNegatives)

391

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 20: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

frequencies of the verbs both by class and individually By class unergatives havethe lowest average log frequency (18) but are the best classied while unaccusativesand object-drops are comparable (average log frequency = 24) If we group individ-ual verbs by frequency the proportion of errors to the total number of verbs is notlinearly related to frequency (log frequency lt 2 7 errors24 verbs or 29 error logfrequency between 2 and 3 7 errors25 verbs or 28 error log frequency gt 3 4errors10 verbs or 40 error) Moreover it seems that the highest-frequency verbspose the most problems to the program In addition the only verb of log frequencylt 1 is correctly classied while the only one with log frequency gt 4 is not In con-clusion we do not nd that there is a simple mapping from frequency to accuracy Inparticular it is not the case that more frequent classes or verbs are more accuratelyclassied

One factor possibly contributing to the poorer performance on unaccusatives andobject-drops is the greater degree of error in the automatic counting procedures forthese verbs which we discussed in Section 32 In addition to exploration of otherlinguistic features another area of future work is to develop better search patterns fortransitivity and passive in particular Unfortunately one limiting factor in automaticcounting is that we inherit the inevitable errors in POS tags in an automatically taggedcorpus For example while the unergative verbs are classied highly accurately wenote that two of the three errors in misclassifying unergatives (galloped and paraded) aredue to a high degree of error in tagging9 The verb galloped is incorrectly tagged VBNinstead of VBD in all 12 of its uses in the corpus and the verb paraded is incorrectlytagged VBN instead of VBD in 13 of its 33 uses in the corpus After correcting onlythe vbn feature of these two verbs to reect the actual part of speech overall accuracyin classication increases by almost 10 illustrating the importance of both accuratecounts and accurate annotation of the corpora

44 Contribution of the Features to ClassicationWe can further use the single hold-out results to determine the contribution of eachfeature to accuracy within each class We do this by comparing the class labels as-signed using the full set of ve features (trans pass vbn caus anim) with the classlabels assigned using each size 4 subset of features The difference in classicationsbetween each four-feature subset and the full set of features indicates the changes inclass labels that we can attribute to the added feature in going from the four-featureto ve-feature set Thus we can see whether the features indeed contribute to dis-criminating the classes in the manner predicted in Section 22 and summarized herein Table 13

We illustrate the data with a set of confusion matrices in Tables 14 and 15 whichshow the pattern of errors according to class label for each set of features In eachconfusion matrix the rows indicate the actual class of incorrectly classied verbs andthe columns indicate the assigned class For example the rst row of the rst panelof Table 14 shows that one unergative was incorrectly labelled as unaccusative andtwo unergatives as object-drop To determine the confusability of any two classes (the

9 The third error in classication of unergatives is the verb oated which we conjecture is due not tocounting errors but to the linguistic properties of the verb itself The verb is unusual for amanner-of-motion verb in that the action is inherently ldquouncontrolledrdquo and thus the subject of theintransitiveobject of the transitive is a more passive entity than with the other unergatives (perhapsindicating that the inventory of thematic roles should be rened to distinguish activity verbs with lessagentive subjects) We think that this property relates to the notion of internal and external causationthat is an important factor in distinguishing unergative and unaccusative verbs We refer the interestedreader to Stevenson and Merlo (1997b) which discusses the latter issue in more detail

392

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 21: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

Table 13Expected class discriminations for each feature

Feature Expected Frequency Pattern

Transitivity Unerg Unacc ObjDrop

Causativity Unerg ObjDrop Unacc

Animacy Unacc Unerg ObjDrop

Passive Voice Unerg Unacc ObjDrop

VBN Tag Unerg Unacc ObjDrop

Table 14Confusion matrix indicating number of errors in classication by verb class for the full set ofve features compared to two of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo caus wo anim

E A O E A O E A O

Actual E 1 2 4 2 2 2

Class A 4 3 5 2 5 6

O 5 3 4 5 3 5

Table 15Confusion matrix indicating number of errors in classication by verb class for the full set ofve features and for three of the four-feature sets E = unergatives A = unaccusativesO = object-drops

Assigned Class

All features wo trans wo pass wo vbn

E A O E A O E A O E A O

Actual E 1 2 2 2 1 3 1 5

Class A 4 3 3 7 1 4 2 4

O 5 3 2 5 5 3 4 6

opposite of discriminability) we look at two cells in the matrix the one in whichverbs of the rst class were assigned the label of the second class and the one inwhich verbs of the second class were assigned the label of the rst class (These pairsof cells are those opposite the diagonal of the confusion matrix) By examining thedecrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the ve-feature experiment we gain insight into how well (orhow poorly) the added feature helps to discriminate each pair of classes

An analysis of the confusion matrices reveals that the behavior of the featureslargely conforms to our linguistic predictions leading us to conclude that the features

393

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 22: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Computational Linguistics Volume 27 Number 3

we counted worked largely for the reasons we had hypothesized We expected causand anim to be particularly helpful in identifying unaccusatives and these predictionsare conrmed Compare the second to the rst panel of Table 14 (the errors withoutthe caus feature compared to the errors with the caus feature added to the set)We see that without the caus feature the confusability between unaccusatives andunergatives and between unaccusatives and object-drops is 9 and 7 errors respec-tively but when caus is added to the set of features the confusability between thesepairs of classes drops substantially to 5 and 6 errors respectively On the other handthe confusability between unergatives and object-drops becomes slightly worse (errorsincreasing from 6 to 7) The latter indicates that the improvement in unaccusatives isnot simply due to an across-the-board improvement in accuracy as a result of havingmore features We see a similar pattern with the anim feature Comparing the thirdto the rst panel of Table 14 (the errors without the anim feature compared to theerrors with the anim feature added to the set) we see an even larger improvement indiscriminability of unaccusatives when the anim feature is added The confusabilityof unaccusatives and unergatives drops from 7 errors to 5 errors and of unaccusativesand object-drops from 11 errors to 6 errors Again confusability of unergatives andobject-drops is worse with an increase in errors of 5 to 7

We had predicted that the trans feature would make a three-way distinctionamong the verb classes based on its predicted linear relationship between the classes(see the inequalities in Table 13) We had further expected that pass and vbn wouldbehave similarly since these features are correlated to trans To make a three-way dis-tinction among the verb classes we would expect confusability between all three pairsof verb classes to decrease (ie discriminability would improve) with the addition oftrans pass or vbn We nd that these predictions are conrmed in part

First consider the trans feature Comparing the second to the rst panel of Ta-ble 15 we nd that unergatives are already accurately classied and the addition oftrans to the set does indeed greatly reduce the confusability of unaccusatives andobject-drops with the number of errors dropping from 12 to 6 However we alsoobserve that the confusability of unergatives and unaccusatives is not improved andthe confusability of unergatives and object-drops is worsened with the addition ofthe trans feature with errors in the latter case increasing from 4 to 7 We concludethat the expected three-way discriminability of trans is most apparent in the reducedconfusion of unaccusative and object-drop verbs

Our initial prediction was that pass and vbn would behave similarly to transmdashthat is also making a three-way distinction among the classesmdashalthough the aggregatedata revealed little difference in these feature values between unaccusatives and object-drops Comparing the third to the rst panel of Table 15 we observe that the additionof the pass feature hinders the discriminability of unergatives and unaccusatives (in-creasing errors from 2 to 5) it does help in discriminating the other pairs of classesbut only slightly (reducing the number of errors by 1 in each case) The vbn fea-ture shows a similar pattern but is much more helpful at distinguishing unergativesfrom object-drops and object-drops from unaccusatives In comparing the fourth tothe rst panel of Table 15 we nd that the confusability of unergatives and object-drops is reduced from 9 errors to 7 and of unaccusatives and object-drops from 10errors to 6 The latter result is somewhat surprising since the aggregate vbn datafor the unaccusative and object-drop classes are virtually identical We conclude thatcontribution of a feature to classication is not predictable from the apparent dis-criminability of its numeric values across the classes This observation emphasizes theimportance of an experimental method to evaluating our approach to verb classica-tion

394

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 23: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof

Merlo and Stevenson Statistical Verb Classication

Table 16Percent agreement (Agr) and pair-wise agreement (K) of three experts (E1 E2 E3) and theprogram compared to each other and to a gold standard (Levin)

Program E1 E2 E3

Agr K Agr K Agr K Agr K

E1 59 36E2 68 50 75 59E3 66 49 70 53 77 66Levin 695 54 71 56 865 80 83 74

5 Establishing the Upper Bound for the Task

In order to evaluate the performance of the algorithm in practice we need to compareit to the accuracy of classication performed by an expert which gives a realistic upperbound for the task The lively theoretical debate on class membership of verbs and thecomplex nature of the linguistic information necessary to accomplish this task led us tobelieve that the task is difcult and not likely to be performed at 100 accuracy evenby experts and is also likely to show differences in classication between expertsWe report here the results of two experiments which measure expert accuracy inclassifying our verbs (compared to Levinrsquos classication as the gold standard) as wellas inter-expert agreement (See also Merlo and Stevenson [2000a] for more details)To enable comparison of responses we performed a closed-form questionnaire studywhere the number and types of the target classes are dened in advance for whichwe prepared a forced-choice and a non-forced-choice variant The forced-choice studyprovides data for a maximally restricted experimental situation which correspondsmost closely to the automatic verb classication task However we are also interestedin slightly more natural resultsmdashprovided by the non-forced-choice taskmdashwhere theexperts can assign the verbs to an ldquoothersrdquo category

We asked three experts in lexical semantics (all native speakers of English) tocomplete the forced-choice electronic questionnaire study Neither author was amongthe three experts who were all professionals in computational or theoretical linguisticswith a specialty in lexical semantics Materials consisted of individually randomizedlists of the same 59 verbs used for the machine learning experiments using Levinrsquos(1993) electronic index available from Chicago University Press The verbs were tobe classied into the three target classesmdashunergative unaccusative and object-dropmdashwhich were described in the instructions10 (All materials and instructions are availableat URL httpwwwlatlunigechlatlpersonalpaolahtml)

Table 16 shows an analysis of the results reporting both percent agreement andpairwise agreement (according to the Kappa statistic) among the experts and theprogram11 Assessing the percentage of verbs on which the experts agree gives us

10 The denitions of the classes were as follows Unergative A verb that assigns an agent theta role to thesubject in the intransitive If it is able to occur transitively it can have a causative meaningUnaccusative A verb that assigns a patienttheme theta role to the subject in the intransitive When itoccurs transitively it has a causative meaning Object-Drop A verb that assigns an agent role to thesubject and patienttheme role to the object which is optional When it occurs transitively it does nothave a causative meaning

11 In the comparison of the program to the experts we use the results of the classier under singlehold-out trainingmdashwhich yields an accuracy of 695mdashbecause those results provide the classicationfor each of the individual verbs

395

Computational Linguistics Volume 27 Number 3

an intuitive measure However this measure does not take into account how muchthe experts agree over the expected agreement by chance The latter is provided bythe Kappa statistic which we calculated following Klauer (1987 55ndash57) (using the zdistribution to determine signicance p lt 0001 for all reported results) The Kappavalue measures the expertsrsquo and our classierrsquos degree of agreement over chancewith the gold standard and with each other Expected chance agreement varies withthe number and the relative proportions of categories used by the experts This meansthat two given pairs of experts might reach the same percent agreement on a giventask but not have the same expected chance agreement if they assigned verbs toclasses in different proportions The Kappa statistic ranges from 0 for no agreementabove chance to 1 for perfect agreement The interpretation of the scale of agreementdepends on the domain like all correlations Carletta (1996) cites the convention fromthe domain of content analysis indicating that 67 lt K lt 8 indicates marginal agree-ment while K gt 8 is an indication of good agreement We can observe that only oneof our agreement gures comes close to reaching what would be considered ldquogoodrdquounder this interpretation Given the very high level of expertise of our human expertswe suspect then that this is too stringent a scale for our task which is qualitativelyquite different from content analysis

Evaluating the expertsrsquo performance summarized in Table 16 we can remark twothings which conrm our expectations First the task is difcultmdashie not performedat 100 (or close) even by trained experts when compared to the gold standard withthe highest percent agreement with Levin at 865 Second with respect to comparisonof the experts among themselves the rate of agreement is never very high and thevariability in agreement is considerable ranging from 53 to 66 This evaluation isalso supported by a 3-way agreement measure (Siegel and Castellan 1988) Applyingthis calculation we nd that the percentage of verbs to which the three experts gavethe same classication (60 K = 06) is smaller than any of the pairwise agreementsindicating that the experts do not all agree on the same subset of verbs

The observation that the experts often disagree on this difcult task suggests thata combination of expert judgments might increase the upper bound We tried thesimplest combination by creating a new classication using a majority vote eachverb was assigned the label given by at least two experts Only three cases did nothave any majority label in these cases we used the classication of the most accurateexpert This new classication does not improve the upper bound reaching only 864(K = 80) compared to the gold standard

The evaluation is also informative with respect to the performance of the programOn the one hand we observe that if we take the best performance achieved by anexpert in this taskmdash865mdashas the maximum achievable accuracy in classicationour algorithm then reduces the error rate over chance by approximately 68 a veryrespectable result In fact the accuracy of 695 achieved by the program is only 15less than one of the human experts in comparison to the gold standard On the otherhand the algorithm still does not perform at expert level as indicated by the fact thatfor all experts the lowest agreement score is with the program

One interesting question is whether experts and program disagree on the sameverbs and show similar patterns of errors The program makes 18 errors in total com-pared to the gold standard However in 9 cases at least one expert agrees with theclassication given by the program The program makes fewer errors on unergatives(3) and comparably many on unaccusatives and object-drops (7 and 8 respectively) in-dicating that members of the latter two classes are quite difcult to classify This differsfrom the pattern of average agreement between the experts and Levin who agree on177 (of 20) unergatives 167 (of 19) unaccusatives and 113 (of 20) object-drops This

396

Merlo and Stevenson Statistical Verb Classication

clearly indicates that the object-drop class is the most difcult for the human expertsto dene This class is the most heterogeneous in our verb list consisting of verbsfrom several subclasses of the ldquounexpressed object alternationrdquo class in (Levin 1993)We conclude that the verb classication task is likely easier for very homogeneousclasses and more difcult for more broadly dened classes even when the exemplarsshare the critical syntactic behaviors

On the other hand frequency does not appear to be a simple factor in explainingpatterns of agreement between experts or increases in accuracy As in Section 43we again analyze the relation between log frequency of the verbs and classicationperformance here considering the performance of the experts We grouped verbs inthree log frequency classes verbs with log frequency less than 2 (ie frequency lessthan 100) those with log frequency between 2 and 3 (ie frequency between 100and 1000) and those with log frequency over 3 (ie frequency over 1000) The low-frequency group had 24 verbs (14 unergatives 5 unaccusatives and 5 object-drop)the intermediate-frequency group had 25 verbs (5 unergatives 9 unaccusatives and11 object-drops) and the high-frequency group had 10 verbs (1 unergative 5 unac-cusatives and 4 object-drops) We found that verbs with high and low frequency yieldbetter accuracy and agreement among the experts than the verbs with mid frequencyNeither the accuracy of the majority classication nor the accuracy of the expert thathad the best agreement with Levin were linearly affected by frequency For the ma-jority vote verbs with frequency less than 100 yield an accuracy of 92 K = 84verbs with frequency between 100 and 1000 accuracy 80 K = 69 and for verbswith frequency over 1000 accuracy 90 K = 82 For the ldquobestrdquo expert the patternis similar verbs with frequency less than 100 yield an accuracy of 875 K = 74verbs with frequency between 100 and 1000 accuracy 84 K = 76 and verbs withfrequency over 1000 accuracy 90 K = 82

We can see here that different frequency groups yield different classication be-havior However the relation is not simple and it is clearly affected by the compositionof the frequency group the middle group contains mostly unaccusative and object-drop verbs which are the verbs with which our experts have the most difculty Thisconrms that the class of the verb is the predominant factor in their pattern of errorsNote also that the pattern of accuracy across frequency groupings is not the same asthat of the program (see Section 43 which revealed the most errors by the program onthe highest frequency verbs) again indicating qualitative differences in performancebetween the program and the experts

Finally one possible shortcoming of the above analysis is that the forced-choicetask while maximally comparable to our computational experiments may not be anatural one for human experts To explore this issue we asked two different expertsin lexical semantics (one native speaker of English and one bilingual) to complete thenon-forced-choice electronic questionnaire study again neither author served as oneof the experts In this task in addition to the three verb classes of interest an answerof ldquootherrdquo was allowed Materials consisted of individually randomized lists of 119target and ller verbs taken from Levinrsquos (1993) electronic index as above The targetswere again the same 59 verbs used for the machine learning experiments To avoidunwanted priming of target items the 60 llers were automatically selected from theset of verbs that do not share any class with any of the senses of the 59 target verbsin Levinrsquos index In this task if we take only the target items into account the expertsagreed 746 of the time (K = 064) with each other and 86 (K = 080) and 69(K = 057) with the gold standard (If we take all the verbs into consideration theyagreed in 67 of the cases [K = 056] with each other and 68 [K = 055] and 605[K = 046] with the gold standard respectively) These results show that the forced-

397

Computational Linguistics Volume 27 Number 3

choice and non-forced-choice task are comparable in accuracy of classication andinter-judge agreement on the target classes giving us condence that the forced-choiceresults provide a reasonably stable upper bound for computational experiments

6 Discussion

The work presented here contributes to some central issues in computational linguis-tics by providing novel insights data and methodology in some cases and by rein-forcing some previously established results in others Our research stems from threemain hypotheses

1 Argument structure is the relevant level of representation for verbclassication

2 Argument structure is manifested distributionally in syntacticalternations giving rise to differences in subcategorization frames or thedistributions of their usage or in the properties of the NP arguments toa verb

3 This information is detectable in a corpus and can be learnedautomatically

We discuss the relevant debate on each of these hypotheses and the contribution ofour results to each in the following subsections

61 Argument Structure and Verb ClassicationArgument structure has previously been recognized as one of the most promisingcandidates for accurate classication For example Basili Pazienza and Velardi (1996)argue that relational properties of verbsmdashtheir argument structuremdashare more infor-mative for classication than their denitional properties (eg the fact that a verbdescribes a manner of motion or a way of cooking) Their arguments rest on linguisticand psycholinguistic results on classication and language acquisition (in particularPinker [1989] Rosch [1978])

Our results conrm the primary role of argument structure in verb classicationOur experimental focus is particularly clear in this regard because we deal with verbsthat are ldquominimal pairsrdquo with respect to argument structure By classifying verbs thatshow the same subcategorizations (transitive and intransitive) into different classeswe are able to eliminate one of the confounds in classication work created by thefact that subcategorization and argument structure are often co-variant We can inferthat the accuracy in our classication is due to argument structure information assubcategorization is the same for all verbs Thus we observe that the content of thethematic roles assigned by a verb is crucial for classication

62 Argument Structure and Distributional StatisticsOur results further support the assumption that thematic differences across verbclasses are apparent not only in differences in subcategorization frames but also indifferences in their frequencies This connection relies heavily on the hypothesis thatlexical semantics and lexical syntax are correlated following Levin (1985 1993) How-ever this position has been challenged by Basili Pazienza and Velardi (1996) andBoguraev and Briscoe (1989) among others For example in an attempt to assessthe actual completeness and usefulness of the Longman Dictionary of ContemporaryEnglish (LDOCE) entries Boguraev and Briscoe (1989) found that people assigned a

398

Merlo and Stevenson Statistical Verb Classication

ldquochange of possessionrdquo meaning both to verbs that had dative-related subcategoriza-tion frames (as indicated in the LDOCE) and to verbs that did not Conversely theyalso found that both verbs that have a change-of-possession component in their mean-ing and those that do not could have a dative code They conclude that the thesis putforth by Levin (1985) is only partially supported Basili Pazienza and Velardi (1996)show further isolated examples meant to illustrate that lexical syntax and semanticsare not in a one-to-one relation

Many recent results however seem to converge in supporting the view that therelation between lexical syntax and semantics can be usefully exploited (Aone andMcKee 1996 Dorr 1997 Dorr Garman and Weinberg 1995 Dorr and Jones 1996 La-pata and Brew 1999 Schulte im Walde 2000 Siegel 1998 Siegel 1999) Our work inparticular underscores the relation between the syntactic manifestations of argumentstructure and lexical semantic class In light of these recent successes the conclusionsin Boguraev and Briscoe (1989) are clearly too pessimistic In fact their results do notcontradict the more recent ones First of all it is not the case that if an implicationholds from argument structure to subcategorization (change of possession implies da-tive shift) the converse also holds It comes as no surprise that verbs that do nothave any change-of-possession component in their meaning may also show dativeshift syntactically Secondly as Boguraev and Briscoe themselves note Levinrsquos state-ment should be interpreted as a statistical trend and as such Boguraev and Briscoersquosresults also conrm it They claim however that in adopting a statistical point of viewpredictive power is lost Our work shows that this conclusion is not appropriate eitherthe correlation is strong enough to be useful to predict semantic classication at leastfor the argument structures that have been investigated

63 Detection of Argument Structure in CorporaGiven the manifestation of argument structure in statistical distributions we view cor-pora especially if annotated with currently available tools as repositories of implicitgrammars which can be exploited in automatic verb-classication tasks Besides es-tablishing a relationship between syntactic alternations and underlying semantic prop-erties of verbs our approach extends existing corpus-based learning techniques to thedetection and automatic acquisition of argument structure To date most work in thisarea has focused on learning of subcategorization from unannotated or syntacticallyannotated text (eg Brent [1993] Sanlippo and Poznanski [1992] Manning [1993]Collins [1997]) Others have tackled the problem of lexical semantic classication butusing only subcategorization frequencies as input data (Lapata and Brew 1999 Schulteim Walde 2000) Specically these researchers have not explicitly addressed the def-inition of features to tap directly into thematic role differences that are not reectedin subcategorization distinctions On the other hand when learning of thematic roleassignment has been the explicit goal the text has been semantically annotated (Web-ster and Marcus 1989) or external semantic resources have been consulted (Aone andMcKee 1996 McCarthy 2000) We extend these results by showing that thematic in-formation can be induced from linguistically-guided counts in a corpus without theuse of thematic role tagging or external resources such as WordNet

Finally our results converge with the increasing agreement that corpus-based tech-niques are fruitful in the automatic construction of computational lexicons providingmachine readable dictionaries with complementary reusable resources such as fre-quencies of argument structures Moreover these techniques produce data that is eas-ily updated as the information contained in corpora changes all the time allowingfor adaptability to new domains or usage patterns This dynamic aspect could be ex-ploited if techniques such as the one presented here are developed which can work

399

Computational Linguistics Volume 27 Number 3

on a rough collection of texts and do not require a carefully balanced corpus or time-consuming semantic tagging

7 Related Work

We conclude from the discussion above that our own work and work of others supportour hypotheses concerning the importance of the relation between classes of verbs andthe syntactic expression of argument structure in corpora In light of this it is instruc-tive to evaluate our results in the context of other work that shares this view Somerelated work requires either exact exemplars for acquisition or external pre-compiledresources For example Dorr (1997) summarizes a number of automatic classicationexperiments based on encoding Levinrsquos alternations directly as symbolic propertiesof a verb (Dorr Garman and Weinberg 1995 Dorr and Jones 1996) Each verb is rep-resented as the binary settings of a vector of possible alternations acquired througha large corpus analysis yielding exemplars of the alternation To cope with sparsedata the corpus information is supplemented by syntactic information obtained fromthe LDOCE and semantic information obtained from WordNet This procedure clas-sies 95 unknown verbs with 61 accuracy Dorr also remarks that this result couldbe improved to 83 if missing LDOCE codes were added While Dorrrsquos work re-quires nding exact exemplars of the alternation Oishi and Matsumoto (1997) presenta method that like ours uses surface indicators to approximate underlying proper-ties From a dictionary of dependency relations they extract case-marking particles asindicators of the grammatical function properties of the verbs (which they call the-matic properties) such as subject and object Adverbials indicate aspectual propertiesThe combination of these two orthogonal dimensions gives rise to a classication ofJapanese verbs

Other work has sought to combine corpus-based extraction of verbal propertieswith statistical methods for classifying verbs Siegelrsquos work on automatic aspectualclassication (1998 1999) also reveals a close relationship between verb-related syn-tactic and semantic information In this work experiments to learn aspectual classi-cation from linguistically-based numerical indicators are reported Using combinationsof seven statistical indicators (some morphological and some reecting syntactic co-occurrences) it is possible to learn the distinction between events and states for 739verb tokens with an improvement of 10 over the baseline (error rate reduction of74) and to learn the distinction between culminated and non-culminated events for308 verb tokens with an improvement of 11 (error rate reduction of 29) (Siegel1999)

In work on lexical semantic verb classication Lapata and Brew (1999) furthersupport the thesis of a predictive correlation between syntax and semantics in a statis-tical framework showing that the frequency distributions of subcategorization frameswithin and across classes can disambiguate the usages of a verb with more than oneknown lexical semantic class On 306 verbs that are disambiguated by subcategoriza-tion frame they achieve 918 accuracy on a task with a 657 baseline for a 76reduction in error rate On 31 verbs that can take the same subcategorization(s) indifferent classesmdashmore similar to our situation in that subcategorization alone cannotdistinguish the classesmdashthey achieve 839 accuracy compared to a 613 baselinefor a 58 reduction in error Aone and McKee (1996) working with a much coarser-grained classication of verbs present a technique for predicate-argument extractionfrom multi-lingual texts Like ours their work goes beyond statistics over subcate-gorizations to include counts over the more directly semantic feature of animacy Nonumerical evaluation of their results is provided

400

Merlo and Stevenson Statistical Verb Classication

Schulte im Walde (2000) applies two clustering methods to two types of frequencydata for 153 verbs from 30 Levin (1993) classes One set of experiments uses verbsubcategorization frequencies and the other uses subcategorization frequencies plusselectional preferences (a numerical measure based on an adaptation of the relativeentropy method of Resnik [1996]) The best results achieved are a correct classicationof 58 verbs out of 153 with a precision of 61 and recall of 36 obtained usingonly subcategorization frequencies We calculate that this corresponds to an F-score of45 with balanced precision and recall12 The use of selectional preference informationdecreases classication performance under either clustering algorithm The results aresomewhat difcult to evaluate further as there is no description of the classes includedAlso the method of counting correctness entails that some ldquocorrectrdquo classes may besplit across distant clusters (this level of detail is not reported) so it is unclear howcoherent the class behaviour actually is

McCarthy (2000) proposes a method to identify diathesis alternations After learn-ing subcategorization frames based on a parsed corpus selectional preferences areacquired for slots of the subcategorization frames using probability distributions overWordnet classes Alternations are detected by testing the hypothesis that given anyverb the selectional preferences for arguments occurring in alternating slots will bemore similar to each other than those for slots that do not alternate For instance givena verb participating in the causative alternation its selectional preferences for the sub-ject in an intransitive use and for the object in a transitive use will be more similarto each other than the selectional preferences for these two slots of a verb that doesnot participate in the causative alternation This method achieves the best accuracyfor the causative and the conative alternations (73 and 83 respectively) despitesparseness of data McCarthy reports that a simpler measure of selectional preferencesbased simply on head words yields a lower 63 accuracy Since this latter measureis very similar to our caus feature we think that our results would also improve byadopting a similar method of abstracting from head words to classes

Our work extends each of these approaches in some dimension thereby provid-ing additional support for the hypothesis that syntax and semantics are correlated ina systematic and predictive way We extend Dorrrsquos alternation-based automatic classi-cation to a statistical setting By using distributional approximations of indicators ofalternations we solve the sparse data problem without recourse to external sources ofknowledge such as the LDOCE and in addition we are able to learn argument struc-ture alternations using exclusively positive examples We improve on the approach ofOishi and Matsumoto (1997) by learning argument structure properties which unlikegrammatical functions are not marked morphologically and by not relying on exter-nal sources of knowledge Furthermore in contrast to Siegel (1998) and Lapata andBrew (1999) our method applies successfully to previously unseen wordsmdashie testcases that were not represented in the training set13 This is a very important propertyof lexical acquisition algorithms to be used for lexicon organization as their maininterest lies in being applied to unknown words

On the other hand our approach is similar to the approaches of Siegel and La-pata and Brew (1999) in attempting to learn semantic notions from distributions of

12 A baseline of 5 is reported based on a closest-neighbor pairing of verbs but it is not straightforwardto compare this task to the proposed clustering algorithm Determining a meaningful baseline forunsupervised clustering is clearly a challenge but this gives an indication that the clustering task isindeed difcult

13 Siegel (1998) reports two experiments over verb types with disjoint training and test sets but theresults were not signicantly different from the baseline

401

Computational Linguistics Volume 27 Number 3

indicators that can be gleaned from a text In our case we are trying to learn argu-ment structure a ner-grained classication than the dichotomic distinctions studiedby Siegel Like Lapata and Brew three of our indicatorsmdashtrans vbn passmdashare basedon the assumption that distributional differences in subcategorization frames are re-lated to underlying verb class distinctions However we also show that other syntacticindicatorsmdashcaus and animmdashcan be devised that tap directly into the argument struc-ture of a verb Unlike Schulte im Walde (2000) we nd the use of these semanticfeatures helpful in classicationmdashusing only trans and its related features vbn andpass we achieve only 55 accuracy in comparison to 698 using the full set of fea-tures This can perhaps be seen as support for our hypothesis that argument structureis the right level of representation for verb class distinctions since it appears that ourfeatures that capture thematic differences are useful in classication while Schulte imWaldersquos selectional restriction features were not

Aone and McKee (1996) also use features that are intended to tap into both sub-categorization and thematic role distinctionsmdashfrequencies of the transitive use andanimate subject use In our task we show that subject animacy can be protably ap-proximated solely with pronoun counts avoiding the need for reference to externalsources of semantic information used by Aone and McKee In addition our work ex-tends theirs in investigating much ner-grained verb classes and in classifying verbsthat have multiple argument structures While Aone and McKee dene each of theirclasses according to a single argument structure we demonstrate the usefulness ofsyntactic features that capture relations across different argument structures of a sin-gle verb Furthermore while Aone and McKee and others look at relative frequencyof subcategorization frames (as with our trans feature) or relative frequency of aproperty of NPs within a particular grammatical function (as with our anim feature)we also look at the paradigmatic relations across a text between thematic argumentsin different alternations (with our caus feature)

McCarthy (2000) shows that a method very similar to ours can be used for identi-fying alternations Her qualitative results conrm however what was argued in Sec-tion 2 above counts that tap directly into the thematic assignments are necessary tofully identify a diathesis alternation In fact on close inspection McCarthyrsquos methoddoes not distinguish between the induced-action alternation (which the unergativesexhibit) and the causativeinchoative alternation (which the unaccusatives exhibit)thus her method does not discriminate two of our classes It is likely that a combina-tion of our method which makes the necessary thematic distinctions and her moresophisticated method of detecting alternations would give very good results

8 Limitations and Future Work

The classication results show that our method is powerful and suited to the clas-sication of unknown verbs However we have not yet addressed the problem ofverbs that can have multiple classications We think that many cases of ambigu-ous classication of the lexical entry for a verb can be addressed with the notionof intersective sets introduced by Dang et al (1998) This is an important conceptwhich proposes that ldquoregularrdquo ambiguity in classicationmdashie sets of verbs that havethe same multi-way classications according to Levin (1993)mdashcan be captured witha ner-grained notion of lexical semantic classes Thus subsets of verbs that occurin the intersection of two or more Levin classes form in themselves a coherent se-mantic (sub)class Extending our work to exploit this idea requires only deningthe classes appropriately the basic approach will remain the same Given the cur-rent demonstration of our method on ne-grained classes that share subcategoriza-

402

Merlo and Stevenson Statistical Verb Classication

tion alternations we are optimistic regarding its future performance on intersectivesets

Because we assume that thematic properties are reected in alternations of argu-ment structure our features require searching for relations across occurrences of eachverb This motivated our initial experimental focus on verb types However whenwe turn to consider ambiguity we must also address the problem that individual in-stances of verbs may come from different classes and we may (like Lapata and Brew[1999]) want to classify the individual tokens of a verb In future research we planto extend our method to the case of ambiguous tokens by experimenting with thecombination of several sources of information the classication of each instance willbe a function of a bias for the verb type (using the cross-corpus statistics we collect)but also of features of the usage of the instance being classied (cf Lapata and Brew[1999] Siegel [1998])

Finally corpus-based learning techniques collect statistical information related tolanguage use and are a good starting point for studying human linguistic perfor-mance This opens the way to investigating the relation of linguistic data in text topeoplersquos linguistic behaviour and use For example Merlo and Stevenson (1998) showthat contrary to the naive assumption speakersrsquo preferences in syntactic disambigua-tion are not simply directly related to frequency (ie a speakerrsquos preference for oneconstruction over another is not simply modelled by the frequency of the construc-tion or of the words in the construction) Thus the kind of corpus investigation weare advocatingmdashfounded on in-depth linguistic analysismdashholds promise for buildingmore natural NLP systems which go beyond the simplest assumptions and tie to-gether statistical computational linguistic results with experimental psycholinguisticdata

9 Conclusions

In this paper we have presented an in-depth case study in which we investigatemachine learning techniques for automatically classifying a set of verbs into classesdetermined by their argument structures We focus on the three major classes of op-tionally intransitive verbs in English which cannot be discriminated by their subcate-gorizations and therefore require distinctive features that are sensitive to the thematicproperties of the verbs We develop such features and automatically extract them fromvery large syntactically annotated corpora Results show that a small number of lin-guistically motivated lexical features are sufcient to achieve a 698 accuracy ratein a three-way classication task with a baseline (chance) performance of 339 forwhich the best performance achieved by a human expert is 865

Returning to our original questions of what can and need be learned about therelational properties of verbs we conclude that argument structure is both a highlyuseful and learnable aspect of verb knowledge We observe that relevant semanticproperties of verb classes (such as causativity or animacy of subject) may be suc-cessfully approximated through countable syntactic features In spite of noisy data(arising from diverse sources such as tagging errors or limitations of our extractionpatterns) the lexical properties of interest are reected in the corpora robustly enoughto positively contribute to classication

We remark however that deep linguistic analysis cannot be eliminatedmdashin ourapproach it is embedded in the selection of the features to count Specically ourfeatures are derived through a detailed analysis of the differences in thematic role as-signments across the verb classes under investigation Thus an important contributionof the work is the proposed mapping between the thematic assignment properties of

403

Computational Linguistics Volume 27 Number 3

the verb classes and the statistical distributions of their surface syntactic propertiesWe think that using such linguistically motivated features makes the approach veryeffective and easily scalable we report a 54 reduction in error rate (a 68 reductionwhen the human expert-based upper bound is considered) using only ve featuresthat are readily extractable from automatically annotated corpora

AcknowledgmentsWe gratefully acknowledge the nancialsupport of the following organizations theSwiss NSF (fellowship 8210-46569 to PM)the United States NSF (grants 9702331 and9818322 to SS) the Canadian NSERC(grant to SS) the University of Toronto andthe Information Sciences Council of RutgersUniversity Much of this research wascarried out while PM was a visiting scientistat IRCS University of Pennsylvania andwhile SS was a faculty member at RutgersUniversity both of whose generous andsupportive environments were of greatbenet to us We thank Martha PalmerMichael Collins Natalia Kariaeva KaminWhitehouse Julie Boland Kiva Dickinsonand three anonymous reviewers for theirhelpful comments and suggestions and fortheir contributions to this research We alsogreatly thank our experts for the graciouscontribution of their time in answering ourelectronic questionnaire

References

Abney Steven 1996 Partial parsing vianite-state cascades In John Carrolleditor Proceedings of the Workshop on RobustParsing at the Eighth Summer School onLogic Language and Information number435 in CSRP pages 8ndash15 University ofSussex Brighton

Aone Chinatsu and Douglas McKee 1996Acquiring predicate-argument mappinginformation in multilingual texts InBranimir Boguraev and JamesPustejovsky editors Corpus Processing forLexical Acquisition MIT Presspages 191ndash202

Basili Roberto Maria-Teresa Pazienza andPaola Velardi 1996 A context-drivenconceptual clustering method for verbclassication In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 117ndash142

Boguraev Branimir and Ted Briscoe 1989Utilising the LDOCE grammar codes InBranimir Boguraev and Ted Briscoeeditors Computational Lexicography forNatural Language Processing Longman

London pages 85ndash116Boguraev Branimir and James Pustejovsky

1996 Issues in text-based lexiconacquisition In Branimir Boguraev andJames Pustejovsky editors CorpusProcessing for Lexical Acquisition MITPress pages 3ndash20

Brent Michael 1993 From grammar tolexicon Unsupervised learning of lexicalsyntax Computational Linguistics19(2)243ndash262

Briscoe Ted and John Carroll 1997Automatic extraction of subcategorizationfrom corpora In Proceedings of the FifthApplied Natural Language ProcessingConference pages 356ndash363

Brousseau Anne-Marie and Elizabeth Ritter1991 A non-unied analysis of agentiveverbs In West Coast Conference on FormalLinguistics number 20 pages 53ndash64

Burzio Luigi 1986 Italian Syntax AGovernment-Binding Approach ReidelDordrecht

Carletta Jean 1996 Assessing agreement onclassication tasks the Kappa statisticsComputational Linguistics 22(2)249ndash254

Collins Michael John 1997 Threegenerative lexicalised models forstatistical parsing In Proceedings of the 35thAnnual Meeting of the ACL pages 16ndash23Madrid Spain

Cruse D A 1972 A note on Englishcausatives Linguistic Inquiry 3(4)520ndash528

Dang Hoa Trang Karin Kipper MarthaPalmer and Joseph Rosenzweig 1998Investigating regular sense extensionsbased on intersective Levin classes InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 293ndash299 Montreal Universit Acircede Montreal

Dixon Robert M W 1994 ErgativityCambridge University Press Cambridge

Dorr Bonnie 1997 Large-scale dictionaryconstruction for foreign language tutoringand interlingual machine translationMachine Translation 12(4)1ndash55

Dorr Bonnie Joe Garman and AmyWeinberg 1995 From syntactic encodingsto thematic roles Building lexical entriesfor interlingual MT Journal of MachineTranslation 9(3)71ndash100

404

Merlo and Stevenson Statistical Verb Classication

Dorr Bonnie and Doug Jones 1996 Role ofword sense disambiguation in lexicalacquisition Predicting semantics fromsyntactic cues In Proceedings of the 16thInternational Conference on ComputationalLinguistics pages 322ndash327 Copenhagen

Dowty David 1991 Thematic proto-rolesand argument selection Language67(3)547ndash619

Greenberg Joseph H 1966 LanguageUniversals Mouton The Hague Paris

Gruber Jeffrey 1965 Studies in LexicalRelation MIT Press Cambridge MA

Hale Ken and Jay Keyser 1993 Onargument structure and the lexicalrepresentation of syntactic relations In KHale and J Keyser editors The View fromBuilding 20 MIT Press pages 53ndash110

Jakobson Roman 1971 Signe Z Acircero InSelected Writings volume 2 2d edMouton The Hague pages 211ndash219

Kariaeva Natalia 1999 Discriminatingbetween unaccusative and object-dropverbs Animacy factor Ms RutgersUniversity New Brunswick NJ

Klavans Judith and Martin Chodorow1992 Degrees of stativity The lexicalrepresentation of verb aspect InProceedings of the Fourteenth InternationalConference on Computational Linguistics(COLING rsquo92) pages 1126ndash1131 NantesFrance

Klavans Judith and Min-Yen Kan 1998Role of verbs in document analysis InProceedings of the 36th Annual Meeting of theACL and the 17th International Conference onComputational Linguistics (COLING-ACLrsquo98) pages 680ndash686 Montreal Universit Acircede Montreal

Lapata Maria 1999 Acquiring lexicalgeneralizations from corpora A casestudy for diathesis alternations InProceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics(ACLrsquo99) pages 397ndash404 College ParkMD

Lapata Maria and Chris Brew 1999 Usingsubcategorization to resolve verb classambiguity In Proceedings of Joint SIGDATConference on Empirical Methods in NaturalLanguage Processing and Very Large Corporapages 266ndash274 College Park MD

Levin Beth 1985 Introduction In BethLevin editor Lexical Semantics in Reviewnumber 1 in Lexicon Project WorkingPapers Centre for Cognitive Science MITCambridge MA pages 1ndash62

Levin Beth 1993 English Verb Classes andAlternations University of Chicago PressChicago IL

Levin Beth and Malka Rappaport Hovav

1995 Unaccusativity MIT PressCambridge MA

Manning Christopher D 1993 Automaticacquisition of a large subcategorizationdictionary from corpora In Proceedings ofthe 31st Annual Meeting of the Association forComputational Linguistics pages 235ndash242Ohio State University

McCarthy Diana 2000 Using semanticpreferences to identify verbalparticipation in role switchingalternations In Proceedings ofANLP-NAACL 2000 pages 256ndash263Seattle WA

McCarthy Diana and Anna Korhonen 1998Detecting verbal participation in diathesisalternations In Proceedings of the 36thAnnual Meeting of the ACL and the 17thInternational Conference on ComputationalLinguistics (COLING-ACL rsquo98)pages 1493ndash1495 Montreal Universit Acirce deMontreal

Merlo Paola and Suzanne Stevenson 1998What grammars tell us about corpora thecase of reduced relative clauses InProceedings of the Sixth Workshop on VeryLarge Corpora pages 134ndash142 Montreal

Merlo Paola and Suzanne Stevenson 2000aEstablishing the upper-bound andinter-judge agreement in a verbclassication task In Second InternationalConference on Language Resources andEvaluation (LREC-2000) volume 3pages 1659ndash1664

Merlo Paola and Suzanne Stevenson 2000bLexical syntax and parsing architectureIn Matthew Crocker Martin Pickeringand Charles Clifton editors Architecturesand Mechanisms for Language ProcessingCambridge University Press Cambridgepages 161ndash188

Moravcsik Edith and Jessica Wirth 1983Markednessmdashan Overview In FredEckman Edith Moravcsik and JessicaWirth editors Markedness Plenum PressNew York NY pages 1ndash13

Oishi Akira and Yuji Matsumoto 1997Detecting the organization of semanticsubclasses of Japanese verbs InternationalJournal of Corpus Linguistics 2(1)65ndash89

Palmer Martha 2000 Consistent criteria forsense distinctions Special Issue ofComputers and the HumanitiesSENSEVAL98 Evaluating Word SenseDisambiguation Systems 34(1ndash2)217ndash222

Perlmutter David 1978 Impersonalpassives and the unaccusative hypothesisIn Proceedings of the Annual Meeting of theBerkeley Linguistics Society volume 4pages 157ndash189

405

Computational Linguistics Volume 27 Number 3

Pinker Steven 1989 Learnability andCognition the Acquisition of ArgumentStructure MIT Press Cambridge MA

Quinlan J Ross 1992 C45 Programs forMachine Learning Series in MachineLearning Morgan Kaufmann San MateoCA

Ratnaparkhi Adwait 1996 A maximumentropy part-of-speech tagger InProceedings of the Empirical Methods inNatural Language Processing Conferencepages 133ndash142 Philadelphia PA

Resnik Philip 1996 Selectional constraintsan information-theoretic model and itscomputational realization Cognition61(1ndash2)127ndash160

Riloff Ellen and Mark Schmelzenbach 1998An empirical approach to conceptual caseframe acquisition In Proceedings of the SixthWorkshop on Very Large Corporapages 49ndash56

Rosch Eleanor 1978 Principles ofcategorization In Cognition andCategorization Lawrence Erlbaum AssocHillsdale NJ

Sanlippo Antonio and Victor Poznanski1992 The acquisition of lexical knowledgefrom combined machine-readabledictionary sources In Proceedings of theThird Applied Natural Language ProcessingConference pages 80ndash87 Trento Italy

Schulte im Walde Sabine 2000 Clusteringverbs semantically according to theiralternation behaviour In Proceedings ofCOLING 2000 pages 747ndash753Saarbruecken Germany

Siegel Eric 1998 Linguistic Indicators forLanguage Understanding Using machinelearning methods to combine corpus-basedindicators for aspectual classication of clausesPhD thesis Dept of Computer ScienceColumbia University

Siegel Eric 1999 Corpus-based linguisticindicators for aspectual classication In

Proceedings of ACLrsquo99 pages 112ndash119College Park MD University ofMaryland

Siegel Sidney and John Castellan 1988Nonparametric statistics for the BehavioralSciences McGraw-Hill New York

Silverstein Michael 1976 Hierarchy offeatures and ergativity In Robert Dixoneditor Grammatical Categories in AustralianLanguages Australian Institute ofAboriginal Studies Canberrapages 112ndash171

Srinivas Bangalore and Aravind K Joshi1999 Supertagging An approach toalmost parsing Computational Linguistics25(2)237ndash265

Stede Manfred 1998 A generativeperspective on verb alternationsComputational Linguistics 24(3)401ndash430

Stevenson Suzanne and Paola Merlo 1997aArchitecture and experience in sentenceprocessing In Proceedings of the 19thAnnual Conference of the Cognitive ScienceSociety pages 715ndash720

Stevenson Suzanne and Paola Merlo 1997bLexical structure and processingcomplexity Language and CognitiveProcesses 12(1ndash2)349ndash399

Stevenson Suzanne Paola Merlo NataliaKariaeva and Kamin Whitehouse 1999Supervised learning of lexical semanticverb classes using frequencydistributions In Proceedings of SigLex99Standardizing Lexical Resources (SigLexrsquo99)pages 15ndash21 College Park MD

Trubetzkoy Nicolaj S 1939 Grundzuge derPhonologie Travaux du CercleLinguistique de Prague Prague

Webster Mort and Mitch Marcus 1989Automatic acquisition of the lexicalsemantics of verbs from sentence framesIn Proceedings of the 27th Annual Meeting ofthe Association for Computational Linguisticspages 177ndash184 Vancouver Canada

406

Merlo and Stevenson Statistical Verb Classication

Appendix A

The following three tables contain the overall frequency and the normalized featurevalues for each of the 59 verbs in our experimental set

Unergative VerbsFreq vbn pass trans caus anim

Min Value 8 000 000 000 000 000Max Value 4088 100 039 074 000 100

oated 176 043 026 074 000 017hurried 86 040 031 050 000 037jumped 4088 009 000 003 000 020leaped 225 009 000 005 000 013marched 238 010 001 009 000 012paraded 33 073 039 046 000 050raced 123 001 000 006 000 015rushed 467 022 012 020 000 010vaulted 54 000 000 041 000 003wandered 67 002 000 003 000 032galloped 12 100 000 000 000 000glided 14 000 000 008 000 050hiked 25 028 012 029 000 040hopped 29 000 000 021 000 100jogged 8 029 000 029 000 033scooted 10 000 000 043 000 000scurried 21 000 000 000 000 014skipped 82 022 002 064 000 016tiptoed 12 017 000 000 000 000trotted 37 019 017 007 000 018

Unaccusative VerbsFreq vbn pass trans caus anim

Min Value 13 016 000 002 000 000Max Value 5543 095 080 076 041 036

boiled 58 092 070 042 000 000cracked 175 061 019 076 002 014dissolved 226 051 058 071 005 011exploded 409 034 002 066 037 004ooded 235 047 057 044 004 003fractured 55 095 076 051 000 000hardened 123 092 055 056 012 000melted 70 080 044 002 000 019opened 3412 021 009 069 016 036solidied 34 065 021 068 000 012collapsed 950 016 000 016 001 002cooled 232 085 021 029 013 011folded 189 073 033 023 000 000widened 1155 018 002 013 041 001changed 5543 073 023 047 022 008cleared 1145 058 040 050 031 006divided 1539 093 080 017 010 005simmered 13 083 000 009 000 000stabilized 286 092 013 018 035 000

407

Computational Linguistics Volume 27 Number 3

Object-Drop VerbsFreq vbn pass trans caus anim

Min Value 39 010 004 021 000 000Max Value 15063 095 099 100 024 042

carved 185 085 066 098 000 000danced 88 022 014 037 000 000kicked 308 030 018 097 000 033knitted 39 095 099 093 000 000painted 506 072 018 071 000 038played 2689 038 016 024 000 000reaped 172 056 005 090 000 022typed 57 081 074 081 000 000washed 137 079 060 100 000 000yelled 74 010 004 038 000 000borrowed 1188 077 015 060 013 019inherited 357 060 013 064 006 032organized 1504 085 038 065 018 007rented 232 072 022 061 000 042sketched 44 067 017 044 000 020cleaned 160 083 047 021 005 021packed 376 084 012 040 005 019studied 901 066 017 057 005 011swallowed 152 079 044 035 004 022called 15063 056 022 072 024 016

Appendix B

Performance of all the subsets of features in order of decreasing accuracy To determinewhether the difference between any two results is statistically signicant the 95condence interval can be calculated for each of the two results and the two rangeschecked to see whether they overlap To do this take each accuracy plus and minus201 times its associated standard error to get the 95 condence range (df = 49t = 201) If the two ranges overlap then the difference in accuracy is not signicantat the p lt 05 level

Accuracy SE Features Accuracy SE Features

698 05 trans pass vbn caus anim 573 05 trans caus698 05 trans vbn caus anim 573 05 pass vbn anim673 06 trans pass vbn anim 567 05 pass caus anim667 05 trans vbn anim 557 05 vbn caus665 05 trans pass caus anim 557 01 caus644 05 trans vbn caus 554 04 pass caus632 06 trans pass vbn caus 550 06 trans pass vbn630 05 trans pass anim 547 04 trans pass629 04 trans caus anim 542 05 trans vbn621 05 caus anim 525 05 vbn617 05 trans pass caus 509 05 pass anim616 06 pass vbn caus anim 502 06 pass vbn601 04 vbn caus anim 502 05 pass595 06 trans anim 471 04 trans594 05 vbn anim 353 05 anim574 06 pass vbn caus

408

Page 24: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 25: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 26: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 27: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 28: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 29: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 30: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 31: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 32: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 33: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 34: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 35: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof
Page 36: AutomaticVerbClassi”cation BasedonStatisticalDistributionsof ArgumentStructurelatl.unige.ch/doc/paola_cl2001.pdf · 2001. 11. 6. · AutomaticVerbClassi”cation BasedonStatisticalDistributionsof