13
Parse correction with specialized models for difficult attachment types Enrique Henestroza Anguiano, Marie Candito To cite this version: Enrique Henestroza Anguiano, Marie Candito. Parse correction with specialized models for difficult attachment types. EMNLP 2011 - The 2011 Conference on Empirical Methods in Natural Language Processing, Jul 2011, Edinburgh, United Kingdom. To appear, 2011. <hal- 00602083> HAL Id: hal-00602083 https://hal.archives-ouvertes.fr/hal-00602083 Submitted on 21 Jun 2011 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destin´ ee au d´ epˆ ot et ` a la diffusion de documents scientifiques de niveau recherche, publi´ es ou non, ´ emanant des ´ etablissements d’enseignement et de recherche fran¸cais ou ´ etrangers, des laboratoires publics ou priv´ es.

Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

Parse correction with specialized models for difficult

attachment types

Enrique Henestroza Anguiano, Marie Candito

To cite this version:

Enrique Henestroza Anguiano, Marie Candito. Parse correction with specialized models fordifficult attachment types. EMNLP 2011 - The 2011 Conference on Empirical Methods inNatural Language Processing, Jul 2011, Edinburgh, United Kingdom. To appear, 2011. <hal-00602083>

HAL Id: hal-00602083

https://hal.archives-ouvertes.fr/hal-00602083

Submitted on 21 Jun 2011

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

Page 2: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

Parse Correction with Specialized Models for Difficult Attachment Types

Enrique Henestroza Anguiano and Marie CanditoAlpage (Universite Paris Diderot / INRIA)

Paris, [email protected], [email protected]

Abstract

This paper develops a framework for syntac-tic dependency parse correction. Dependen-cies in an input parse tree are revised by se-lecting, for a given dependent, the best gov-ernor from within a small set of candidates.We use a discriminative linear ranking modelto select the best governor from a group ofcandidates for a dependent, and our model in-cludes a rich feature set that encodes syntac-tic structure in the input parse tree. The parsecorrection framework is parser-agnostic, andcan correct attachments using either a genericmodel or specialized models tailored to dif-ficult attachment types like coordination andpp-attachment. Our experiments show thatparse correction, combining a generic modelwith specialized models for difficult attach-ment types, can successfully improve the qual-ity of predicted parse trees output by sev-eral representative state-of-the-art dependencyparsers for French.

1 Introduction

In syntactic dependency parse correction, attach-ments in an input parse tree are revised by selecting,for a given dependent, the best governor from withina small set of candidates. The motivation behindparse correction is that attachment decisions, espe-cially traditionally difficult ones like pp-attachmentand coordination, may require substantial contextualinformation in order to be made accurately. Becausesyntactic dependency parsers predict the parse treefor an entire sentence, they may not be able to take

into account sufficient context when making attach-ment decisions, due to computational complexity.Assuming nonetheless that a predicted parse tree ismostly accurate, parse correction can revise difficultattachments by using the predicted tree’s syntacticstructure to restrict the set of candidate governorsand extract a rich set of features to help select amongthem. Parse correction is also appealing because itis parser-agnostic: it can be trained to correct theoutput of any dependency parser.

In Section 2 we discuss work related to parsecorrection, pp-attachment and coordination resolu-tion. In Section 3 we discuss dependency struc-ture and various statistical dependency parsing ap-proaches. In Section 4 we introduce the parse cor-rection framework, and Section 5 describes the fea-tures and learning model used in our implementa-tion. In Section 6 we present experiments in whichparse correction revises the predicted parse trees offour state-of-the-art dependency parsers for French.We provide concluding remarks in Section 7.

2 Related Work

Previous research directly concerning parse correc-tion includes that of Attardi and Ciaramita (2007),working on English and Swedish, who use an ap-proach that considers a fixed set of revision rules:each rule describes movements in the parse treeleading from a dependent’s original governor to anew governor, and a classifier is trained to selectthe correct revision rule for a given dependent. Onedrawback of this approach is that the classes lacksemantic coherence: a sequence of movements doesnot necessarily have the same meaning across differ-

Page 3: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

ent syntactic trees. Hall and Novak (2005), workingon Czech, define a neighborhood of candidate gov-ernors centered around the original governor of a de-pendent, and a Maximum Entropy model determinesthe probability of each candidate-dependent attach-ment. We follow primarily from their work in ouruse of neighborhoods to delimit the set of candidategovernors. Our main contributions are: specializedcorrective models for difficult attachment types (co-ordination and pp-attachment) in addition to a gen-eral corrective model; more sophisticated features,feature combinations, and feature selection; and aranking model trained directly to select the true gov-ernor from among a set of candidates.

There has also been other work on techniquessimilar to parse correction. Attardi and Dell’Orletta(2009) investigatereverse revision: a left-to-righttransition-based model is first used to parse a sen-tence, then a right-to-left transition-based model isrun with additional features taken from the left-to-right model’s predicted parse. This approach leadsto improved parsing results on a number of lan-guages. While their approach is similar to parse cor-rection in that it uses a predicted parse to inform asubsequent processing step, this information is usedto improve a second parser rather than a model forcorrecting errors. McDonald and Pereira (2006)consider a method for recovering non-projective at-tachments from a graph representation of a sentence,in which an optimal projective parse tree has beenidentified. The parse tree’s edges are allowed to berearranged in ways that introduce non-projectivityin order to increase its overall score. This rearrange-ment approach resembles parse correction becauseit is a second step that can revise attachments madein the first step, but it differs in a number of ways: itis dependent on a graph-based parsing approach, itdoes not model errors made by the parser, and it canonly output non-projective variants of the predictedparse tree.

As a process that revises the output of a syntac-tic parser, parse reranking is also similar to parsecorrection. A well-studied subject (e.g. the workof Charniak and Johnson (2005) and of Collins andKoo (2005)), parse reranking is concerned with thereordering ofn-best ranked parse trees output bya syntactic parser. Parse correction has a num-ber of advantages compared to reranking: it can be

used with parsers that do not outputn-best rankedparses, it can be easily restricted to specific attach-ment types, and its output space of parse trees is notlimited to those appearing in ann-best list. How-ever, parse reranking has the advantage of selectingthe globally optimal parse for a sentence from ann-best list, while parse correction makes only locallyoptimal revisions in the predicted parse for a sen-tence.

2.1 Difficult Attachment Types

Research on pp-attachment traditionally formulatesthe problem in isolation, as in the work of Pantel andLin (2000) and of Olteanu and Moldovan (2005).Examples consist of tuples of the form(v, n1, p, n2),where eitherv or n1 is the true governor of thepp comprisingp andn2, and the task is to choosebetweenv andn1. Recently, Atterer and Schutze(2007) have criticized this formulation as unrealisticbecause it uses an oracle to select candidate gover-nors, and they find that successful approaches forthe isolated problem perform no better than state-of-the-art parsers on pp-attachment when evaluatedon full sentences. With parse correction, candi-date governors are identified automatically with no(v, n1, p, n2) restriction, and for several representa-tive parsers we find that parse correction improvespp-attachment performance.

Research on coordination resolution has also of-ten formulated the problem in isolation. Resnik(1999) uses semantic similarity to resolve noun-phrase coordination of the form(n1, cc, n2, n3),where the coordinating conjunctioncc coordinateseither the headsn1 and n2 or the headsn1 andn3. The same criticism as the one made by At-terer and Schutze (2007) for pp-attachment mightbe applied to this approach to coordination reso-lution. In another formulation, the input consistsof a raw sentence, and coordination structure isthen detected and disambiguated using discrimina-tive learning models (Shimbo and Hara, 2007) orcoordination-specific parsers (Hara et al., 2009). Fi-nally, other work has focused on introducing spe-cialized features for coordination into existing syn-tactic parsing models (Hogan, 2007). Our approachis novel with respect to previous work by directlymodeling the correction of coordination errors madeby general-purpose dependency parsers.

Page 4: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

ouvrit

Elle porte

la

avec

cle

la

Figure 1: An unlabeled dependency tree for:Elle ouvritla porte avec la cle. (She opened the door with the key).

3 Dependency Parsing

Dependency syntax involves the representation ofsyntactic information for a sentence in the form adirected graph, whose edges encode word-to-wordrelationships. An edge from agovernor to a de-pendentindicates, roughly, that the presence of thedependent is syntactically legitimated by the gover-nor. An important property of dependency syntax isthat each word, except for the root of the sentence,has exactly one governor; dependency syntax is thusrepresented by trees. Figure 1 shows an exampleof an unlabeled dependency tree.1 For languageslike English or French, most sentences can be rep-resented with aprojectivedependency tree: for anyedge from wordg to wordd, g dominates any inter-vening word betweeng andd.

Dependency trees are appealing syntactic repre-sentations, closer than constituency trees to the se-mantic representations useful for NLP applications.This is true even with the projectivity requirement,which occasionally creates syntax-semantics mis-matches. Dependency trees have recently seen asurge of interest, particularly with the introductionof supervised models for dependency parsing us-ing linear classifiers. Such parsers fall into twomain categories: transition-based parsing and graph-based parsing. Additionally, an alternative methodfor obtaining the dependency parse for a sentenceis to parse the sentence with a constituency-basedparser and then use an automatic process to convertthe output into dependency structure.

1Edges are generally labeled with the surface grammaticalfunction that the dependent bears with respect to its governor.In this paper we focus on unlabeled dependency parsing, settingaside labeling as a separate task.

3.1 Transition-Based Parsing

In transition-based dependency parsing, whose sem-inal works are that of Yamada and Matsumoto(2003) and Nivre (2003), the parsing process ap-plies a sequence of incremental actions, which typ-ically manipulate a buffer position in the sentenceand a stack for built sub-structures. Actions are ofthe type “read word from buffer”, “ build a depen-dency from node on top of the stack to node thatbegins the buffer”, etc. In a greedy version of thisprocess, the action to apply at each step is determin-istically chosen to be the best-scoring action accord-ing to a classifier, which is trained on a dependencytreebank converted into sequences of actions. Thestrengths of this framework areO(n) time complex-ity and a lack of restrictions on the locality of fea-tures. A major drawback is its greedy behavior: itcan potentially make difficult attachment decisionsearly in the processing of a sentence, without beingable to reconsider them when more information be-comes available. Beamed versions of the algorithm(Johansson and Nugues, 2006) partially address thisproblem, but still do not provide a global optimiza-tion for selecting the output parse tree.

3.2 Graph-Based Parsing

In graph-based dependency parsing, whose seminalwork is that of McDonald et al. (2005), the parsingprocess selects the globally optimal parse tree froma graph containing attachments (directed edges) be-tween each pair of words (nodes) in a sentence.It finds thek-best scoring parse trees, both duringtraining and at parse time, where the score of a treeis the sum of the scores of itsfactors(consisting ofone or more linked edges). While large factors aredesirable for capturing sophisticated linguistic con-straints, they come at the cost of time complexity:for the projective case, adaptations of Eisner’s algo-rithm (Eisner, 1996) areO(n3) for 1-edge factors(McDonald et al., 2005) or sibling 2-edge factors(McDonald and Pereira, 2006), andO(n4) for gen-eral 2-edge factors (Carreras, 2007) or 3-edge fac-tors (Koo and Collins, 2010).

3.3 Constituency-Based Parsing

Beyond the two main approaches to dependencyparsing, there is also the approach of constituency-

Page 5: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

based parsing followed by a conversion step to de-pendency structure. We use the three-step parsingarchitecture previously tested for French by Canditoet al. (2010a): (i) A constituency parse tree is out-put by the BerkeleyParser, which has been trained tolearn a probabilistic context-free grammar with la-tent annotations (Petrov et al., 2006) that has parsingtime complexityO(n3) (Matsuzaki et al., 2005); (ii)A functional role labeler using a Maximum Entropymodel adds functional annotations to links betweena verb and its dependents; (iii) Constituency treesare automatically converted into projective depen-dency trees, with remaining unlabeled dependenciesassigned labels using a rule-based approach.

3.4 Baseline Parsers

In this paper, we use the following baseline parsers:MaltParser (Nivre et al., 2007) for transition-basedparsing; MSTParser (McDonald et al., 2005) (withsibling 2-edge factors) and BohnetParser (Bohnet,2010) (with general 2-edge factors) for graph-basedparsing; and BerkeleyParser (Petrov et al., 2006) forconstituency-based parsing.

For MaltParser and MSTParser, we use the bestsettings from a benchmarking of parsers for French(Candito et al., 2010b), except that we remove un-supervised word clusters as features. The parsingmodels are thus trained using features including pre-dicted part-of-speech tags, lemmas and morpholog-ical features. For BohnetParser, we trained a newmodel using these same predicted features. ForBerkelyParser, which was included in the bench-marking experiments, we trained a model using theso-called “desinflection” process that addresses datasparseness due to morphological variation: bothat training and parsing time, terminal symbols areword forms in which redundant morphological suf-fixes are removed, provided the original part-of-speech ambiguities are kept (Candito et al., 2010b).

All models are trained on the French Treebank(FTB) (Abeille and Barrier, 2004), consisting of12,351 sentences from theLe Mondenewspaper, ei-ther “desinflected” for the BerkeleyParser, or con-verted to projective dependency trees (Candito et al.,2010a) for the three dependency-native parsers.2 For

2The projectivity constraint is linguistically valid for mostFrench parses: the authors report< 2% non-projective edges ina hand-corrected subset of the converted FTB.

INPUT: Predicted parse treeT

LOOP: For each chosen dependentd ∈ D

• Identify candidatesCd from T

• Predictc = argmaxc ∈ Cd

S(c, d, T )

• UpdateT{gov(d)← c}

OUTPUT: Corrected version of parse treeT

Figure 2: The parse correction algorithm.

the dependency-native models, features include pre-dicted part-of-speech (POS) tags from the MElt tag-ger (Denis and Sagot, 2009), as well as predictedlemmas and morphological features from the Leffflexicon (Sagot, 2010). These models constitute thestate-of-the-art for French dependency parsing: un-labeled attachment scores (UAS) on the FTB test setare89.78% for MaltParser,91.04% for MSTParser,91.78% for BohnetParser, and90.73% for Berkeley-Parser.

4 Parse Correction

The parse correction algorithm is a post-processingstep to dependency parsing, where attachments fromthe predicted parse tree of a sentence are correctedby considering alternative candidate governors foreach dependent. This process can be useful for at-tachments made too early in transition-based pars-ing, or with features that are too local in MST-basedparsing.

The input is the predicted parseT of a sentence.FromT a setD of dependent nodes are chosen forattachment correction. For eachd ∈ D in left-to-right sentence order, a setCd of candidate governorsfrom T is identified, and then the highest scoringc ∈ Cd, using a functionS(c, d, T ), is assigned asthe new governor ofd in T . Pseudo-code for parsecorrection is shown in Figure 2.3

3Contrary to Hall and Novak (2005), our iterative algorithm(along with the fact thatCd never includes nodes that are domi-nated byd) ensures that corrected structures are trees, so it doesnot require additional processing to eliminate cycles and pre-serve connectivity.

Page 6: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

4.1 Choosing Dependents

Various criteria may be used to choose the setDof dependents to correct. In the work of Hall andNovak (2005) and of Attardi and Ciaramita (2007),D contains all nodes in the input parse tree. How-ever, one advantage of parse correction is its abilityto focus on specific attachment types, so an addi-tional criterion for choosing dependents is to lookseparately at those dependents that correspond todifficult attachment types.

Analyzing errors made by the dependency parsersintroduced in Section 3 on the development set ofthe FTB, we observe that two major sources of er-ror across different parsers are coordination and pp-attachment. Coordination accounts for around10%of incorrect attachments and has an error rate rang-ing from 30 − 40%, while pp-attachment accountsfor around30% of incorrect attachments and has anerror rate of around15%.

In this paper, we pay special attention to coordina-tion and pp-attachment. Given the FTB annotationscheme, coordination can be corrected by changingthe governor (first conjunct) of the coordinating con-junction that governs the second conjunct, and pp-attachment can be corrected by changing the gover-nor of the preposition that heads the pp.4 We thustrain specialized corrective models for when the de-pendents are coordinating conjunctions and preposi-tions, in addition to a generic corrective model thatcan be applied to any dependent.5

4.2 Identifying Candidate Governors

The set of candidate governorsCd for a dependentd can be chosen in different ways. One method isto let every other node inT be a candidate gover-nor ford. However, parser error analysis has shownthat errors often occur in local contexts. Hall andNovak (2005) define a neighborhood as a set ofnodesNm(d) around the original predicted gover-nor co of d, whereNm(d) includes all nodes in the

4The FTB handles pp-attachment in a typical fashion, butcoordination may be handled differently by other schemes (e.g.the coordinating conjunction governs both conjuncts).

5In our experiments, we never revise punctuation and cliticdependents. Since punctuation attachments mostly carry littlemeaning, they are often annotated inconsistently and ignoredin parsing evaluations (including ours). Clitics are not revisedbecause they have a very low attachment error rate (2%).

parse treeT within graph distancem of d that passthroughco. They find that around2/3 of the incor-rect attachments in the output of Czech parses can becorrected by selecting the best governor from withinN3(d). Similarly, in oracle experiments reported insection 6, we find that around1/2 of coordinationand pp-attachments in the output of French parsescan be corrected by selecting the best governor fromwithin N3(d). We thus use neighborhoods to delimitthe set of candidate governors.

While one can simply assignCd ← Nm(d), weadd additional restrictions. First, in order to preserveprojectivity within T , we keep inCd only thosecsuch that the updateT{gov(d) ← c} would resultin a projective tree.6 Additionally, we discard candi-dates with certain POS categories that are very un-likely to be governors: clitics and punctuation arealways discarded, while determiners are discarded ifthe dependent is a preposition.

4.3 Scoring Candidate Governors

A new governorc for a dependentd is predicted byselecting the highest scoring candidatec ∈ Cd ac-cording to a functionS(c, d, T ), which takes intoaccount features overc, d, and the parse treeT . Weuse a linear model for our scoring function, whichallows for relatively fast training and prediction. Ourscoring function uses a weight vector~w ∈ F, whereF is the feature space for dependents we wish to cor-rect (either generic, or specialized for prepositionsor for coordinating conjunction), as well as the map-pingΦ : C×D×T→ F from combinations of candi-datec ∈ C, dependentd ∈ D, and parse treeT ∈ T,to vectors in the feature spaceF. The scoring func-tion returns the inner product of~w andΦ(c, d, T ):

S(c, d, T ) = ~w · Φ(c, d, T ) (1)

4.4 Algorithm Complexity

The time complexity of our algorithm isO(n) inthe lengthn of the input sentence, which is consis-tent with past work on parse correction by Hall andNovak (2005) and by Attardi and Ciaramita (2007).

6We also keep candidates that would lead to a non-projectivetree, as long as it would be projective if we ignored punctuation.This relaxation of the projectivity constraint leads to better or-acle scores while retaining the key linguistic properties of pro-jectivity.

Page 7: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

Attachments for up ton dependents in a sentenceare deterministically corrected in one pass. For eachsuch dependentd, the algorithm uses a linear modelto select a new governor after extracting features fora local set of candidate governorsCd, whose sizedoes not dependent onn in the average case.7 Lo-cality in candidate governor identification and fea-ture extraction preserves linear time complexity inthe overall algorithm.

5 Model Learning

We now discuss our training setup, features, andlearning approach for obtaining the weight vector~w.

5.1 Training Setup

The parse correction training set pairs gold parsetrees with corresponding predicted parse trees out-put by a syntactic parser, and it is obtained us-ing a jackknifing procedure to automatically parsethe gold-annotated training section of a dependencytreebank with a syntactic dependency parser.

We extract separate training sets for each type ofdependent we wish to correct (generic, prepositions,coordinating conjunctions). Givenp, then for eachtokend we wish to correct in a sentence in the train-ing section, we note its true governorgd in the goldparse tree of the sentence, identify a set of candidategovernorsCd in the predicted parseT , and get fea-ture vectors{Φ(c, d, T ) : c ∈ Cd}.

5.2 Feature Space

In order to learn an effective scoring function, weuse a rich feature spaceF that encodes syntactic con-text surrounding a candidate-dependent pair(c, d)within a parse treeT . Our primary features are indi-cator functions for realizations of linguistic or tree-based feature classes.8 From these primary featureswe generate more complex feature combinations oflength up toP , which are then added toF. Eachcombo represents a set of one or more primary fea-tures, and is an indicator function that fires if andonly if all of its members do.

7Degenerate parse trees (e.g. flat trees) could lead to caseswhere|Cd|=n, but for linguistically coherent parse trees|Cd| isratherO(km), wherek is the average-arity of syntactic parsetrees andm is the neighborhood distance used.

8For instance, there is a binary feature that is 1 if featureclass ”POS ofc” takes on the value ”verb”, and 0 otherwise.

5.2.1 Primary Feature Classes

The primary feature classes we use are listed be-low, grouped into categories corresponding to theiruse in different corrective models (dobj is the objectof the dependent,cgov is the governor of the candi-date, andcd−1 andcd+1 are the closest dependentsof c linearly to the left and right, respectively, ofd).

Generic features (always included)

− POS, lemma, and number of dependents ofc

− POS and dependency label ofcd−1

− POS and dependency label ofcd+1

− POS ofcgov

− POS and lemma ofd

− POS ofdobj and whetherdobj has a determiner

− Whetherc is the predicted governor ofd

− Binned linear distance betweenc andd

− Linear direction ofc with respect tod

− POS sequence for nodes on path fromc to d

− Graph distance betweenc andd

− Whether there is punctuation betweenc andd

Features exclusive to coordination

Whetherd would coordinate two conjuncts that:

− Have the same POS

− Have the same word form

− Have number agreement

− Are both nouns with the same cardinality

− Are both proper nouns or both common nouns

− Are both prepositions with the same word form

− Are both prepositions with object of same POS

Features exclusive to pp-attachment

− Whetherd immediately follows a punctuation

− Whetherd heads a pp likely to be the agent ofa passive verb

− If c is a coordinating conjunction, then whetherc would coordinate two prepositions with thesame word form, and whether there is at leastone open-category word linearly betweenc andd (in which casec is an unlikely governor)

Page 8: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

− If c is linearly afterd, then whether there existsa plausible rival candidate to the left ofd (im-plemented as whether there is a noun or adjec-tive linearly befored, without any interveningfinite verb)

5.2.2 Feature Selection

Feature combos allow our models to effectivelysidestep linearity constraints, at the cost of an expo-nential increase in the size of the feature spaceF. Inorder to accommodate combos, we use feature se-lection to help reduce the resulting space.

Our first feature selection technique is to apply afrequency threshold: if a feature or a combo appearsless thanK times among instances in our trainingset, we remove it fromF. In addition to making thefeature space more tractable, frequency thresholdingmakes our scoring function less reliant on rare fea-tures and combos.

Following frequency thresholding, we employ anadditional technique using conditional entropy (CE)that we termCE-reduction. LetY be a random vari-able for whether or not an attachment is true, andletA be a random variable for different combos thatcan appear in an attachment. We calculate the CE ofa comboa with respect toY as follows,

H(Y |A=a) = −∑

y∈Y

p(y|a) log p(y|a) (2)

where the probabilityp(y|a) is approximated fromthe training set asfreq(a, y)/freq(a), with exam-ple balancing used here to account for more false at-tachments (Y = 0) than true ones (Y = 1) in our train-ing set. Having calculated the CE of each combo,we remove fromF those combos for which a subsetcombo (or feature) exists with equal or lesser CE.This eliminates any overly specific comboa whenthe extra features encoded ina, compared to somesubsetb, do not helpa explainY any better thanb.

5.3 Ranking Model

The ranking setting for learning is used when amodel needs to discriminate between mutually ex-clusive candidates that vary from instance to in-stance. This is typically used in parse reranking(Charniak and Johnson, 2005), where for each sen-tence the model must select the correct parse fromwithin an n-best list. Denis and Baldridge (2007)

INPUT: AggressivenessC, roundsR.

INITIALIZE : ~w0←(0, ..., 0), ~wavg←(0, ..., 0)

REPEAT: R times

LOOP: For t = 1, 2, . . . , |X|

· Get feature vectors{~xt,c : c ∈ Cdt}

· Get true governorgt ∈ Cdt

· Let ht = argmaxc∈Cdt

−{gt}

(~wt−1 · ~xt,c)

· Letmt = (~wt−1 · ~xt,gt)− (~wt−1 · ~xt,ht)

IF: mt < 1

· Let τt = min

{

C , 1−mt

‖~xt,gt−~xt,ht‖

2

}

· Set ~wt ← ~wt−1 + τt(~xt,gt − ~xt,ht)

ELSE:· Set ~wt ← ~wt−1

· Set ~wavg ← ~wavg + ~wt

· Set ~w0 ← ~w|X|

OUTPUT: ~wavg/(R · |X|)

Figure 3: Averaged PA-Ranking training algorithm.

also show that ranking outperforms a binary classifi-cation approach to pronoun resolution (using a Max-imum Entropy model), where for each pronominalanaphor the model must select the correct antecedentamong candidates in a text.9

In our ranking approach to parse correction (PA-Ranking), the weight vector is trained to select thetrue governor from a set of candidatesCd for a de-pendentd. The training setX is defined such thatthe tth instance is a collection of feature vectors{~xt,c = Φ(c, dt, Tt) : c ∈ Cdt}, whereCdt is thecandidate set for the dependentdt within the pre-dicted parseTt, and the class is the true governorgt.Instances in whichgt 6∈ Cdt are discarded.

PA-Ranking training is carried out using a varia-tion of the Passive-Aggressive algorithm (Crammeret al., 2006), which has been adapted to the rank-ing setting, implemented using the Polka library.10

For each training iterationt, the margin is defined as

9We considered a binary training approach to parse correc-tion in which the model is trained to independently classify can-didates as true or false governors, as used by Hall and Novak(2005). However, we found that this approach performs no bet-ter (and often worse) than the ranking approach, and is less ap-propriate from a modeling standpoint.

10http://polka.gforge.inria.fr/

Page 9: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

mt = (~wt−1 · ~xt,gt)− (~wt−1 · ~xt,ht), whereht is the

highest scoring incorrect candidate. The algorithmis passivebecause an update to the weight vector ismade if and only ifmt< 1, either for incorrect pre-dictions (mt < 0) or for correct predictions with in-sufficient margin (0≤mt<1). The new weight vec-tor ~wt is as close as possible to~wt−1, subject to theaggressiveconstraint that the new margin be greaterthan1. We use weight averaging, so the final out-put ~wavg is the average over the weight vectors aftereach training step. Pseudo-code for the training al-gorithm is shown in Figure 3. The rounds parameterR determines the number of times to run through thetraining set, and the aggressiveness parameterC setsan upper limit on the update magnitude.

6 Experiments

We present experiments where we applied parse cor-rection to the output of four state-of-the-art depen-dency parsers for French. We conducted our eval-uation on the FTB using the standard training, de-velopment (dev), and test splits (containing 9,881,1,235 and 1,235 sentences, respectively). To trainour parse correction models, we generated special-ized training sets corresponding to each parser bydoing 10-fold jackknifing on the FTB training set(cf. Section 5.1). Each parser was run on the FTBdev and test sets, providing baseline unlabeled at-tachment score (UAS) results and output parse treesto be corrected.

6.1 Oracles and Neighborhood Size

To determine candidate neighborhood size, we con-sidered an oracle scoring function that always se-lects the true governor of a dependent if it appearsin the set of candidate governors, and otherwise se-lects the predicted governor. Results for this oracleon the dev set are shown in Table 1. The baselinecorresponds tom=1, where the oracle just selectsthe predicted governor. Incrementingm to 2 andto 3 resulted in substantial gains in oracle UAS, butfurther incrementingm to 4 resulted in a relativelysmall additional gain. We found that average can-didate set size increases about linearly inm, so wedecided to usem=3 in order to have a high UAS up-per bound without adding candidates that are veryunlikely to be true governors.

Neighborhood Size (m)Base 2 3 4

BerkeleyCoords 67.2 76.5 82.8 84.8Preps 82.9 88.5 92.2 93.2

Overall 90.1 94.0 96.0 96.5

BohnetCoords 70.1 80.6 85.6 87.7Preps 85.4 89.4 93.4 94.5

Overall 91.2 94.4 96.1 96.6

MaltCoords 60.9 72.2 78.2 80.5Preps 82.6 88.1 92.6 93.7

Overall 89.3 93.2 95.1 95.8

MSTCoords 63.6 73.7 80.7 84.4Preps 84.7 89.4 93.4 94.4

Overall 90.2 93.7 95.6 96.2

MST Overall Reranking top-100 parses: 95.4

Table 1: Parse correction oracle UAS (%) for differ-ent neighborhood sizes, by dependent type (coordinatingconjunctions, prepositions, or all dependents). Also, areranking oracle for MSTParser using the top-100 parses.

We also compared the oracle for parse correc-tion with an oracle for parse reranking, in which theparse with the highest UAS for a sentence is selectedfrom the top-100 parses output by MSTParser. Wefound that for MSTParser, the oracle for parse cor-rection using neighborhood sizem=3 (95.6% UAS)is comparable to the oracle for parse reranking usingthe top-100 parses (95.4% UAS). This is an encour-aging result, showing that parse correction is capableof the same improvement as parse reranking withoutneeding to process ann-best list of parses.

6.2 Feature Space Parameters

For the feature spaceF, we performed a grid searchto find good values for the parametersK (frequencythreshold),P (combo length), and CE-reduction.We found thatP=3 with CE-reduction allowed forthe most compactness without sacrificing correctionperformance, for all of our corrective models. Ad-ditionally, K=2 worked well for the coordinatingconjunction models, whileK=10 worked well forthe preposition and generic models. CE-reductionproved useful in greatly reducing the feature spacewithout lowering correction performance: it reducedthe size of the coordinating conjunction models from400k to 65k features each, the preposition modelsfrom 400k to 75k features each, and the genericmodels from 800k to 200k features each.

Page 10: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

Corrective UAS (%)Configuration Coords Preps Overall

BerkeleyBaseline 68.3 83.8 90.73Generic 69.4 84.9* 91.13*

Specialized 71.5* 85.1* 91.23*

BohnetBaseline 70.5 86.1 91.78Generic 71.2 86.4 91.88

Specialized 72.7* 86.2 91.88

MaltBaseline 59.8 83.2 89.78Generic 63.2* 84.5* 90.39*

Specialized 64.0* 85.0* 90.47*

MSTBaseline 60.5 85.9 91.04Generic 64.2* 86.2 91.25*

Specialized 68.0* 86.2 91.36*

Table 2: Coordinating conjunction, preposition, and over-all UAS (%) by corrective configuration on the test set.Significant improvements over the baseline starred.

6.3 Corrective Configurations

For our evaluation of parse correction, we comparedtwo different configurations:generic (corrects alldependents using the generic model) andspecialized(corrects coordinating conjunctions and prepositionsusing their respective specialized models, and cor-rects other dependents using the generic model).The PA-Ranking aggressiveness parameterC wasset to 1 for our experiments, while the rounds pa-rameterR was tuned separately for each correctivemodel using the dev set. For our final tests, we ap-plied each combination of parser + corrective con-figuration by sequentially revising all dependents inthe output parse that had a relevant POS tag giventhe corrective configuration. In the FTB test set,this amounted to an evaluation over 5,706 prepo-sition tokens, 801 coordinating conjunction tokens,and 31,404 overall (non-punctuation) tokens.11

6.4 Results

Final results for the test set are shown in Table 2.The overall UAS of each parser (except Bohnet-Parser) was significantly improved under both cor-rective configurations.12 Thespecializedconfigura-

11Since the MElt tagger and BerkeleyParser POS tagging ac-curacies were around97%, the sets of tokens considered for re-vision differed slightly from the sets of tokens (with gold POStags) used to calculate UAS scores.

12We used McNemar’s Chi-squared test withp = 0.05 for allsignificance tests.

tion performed as well as, and in most cases bet-ter than, thegeneric configuration, indicating theusefulness of specialized models and features fordifficult attachment types. Interestingly, the lowerthe baseline parser’s UAS, the larger the overallimprovement from parse correction under thespe-cialized configuration: MaltParser had the lowestbaseline and the highest error reduction (6.8%),BerkeleyParser had the second-lowest baseline andthe second-highest error reduction (5.4%), MST-Parser had the third-lowest baseline and the third-highest error reduction (3.6%), and BohnetParserhad the highest baseline and the lowest error re-duction (1.2%). It may be that the additional er-rors made by a low-baseline parser, compared to ahigh-baseline parser, involve relatively simpler at-tachments that parse correction can better model.

Parse correction achieved significant improve-ments for coordination resolution under thespe-cialized configuration for each parser. MaltParserand MSTParser had very low baseline coordinat-ing conjunction UAS (around60%), while Berke-leyParser and BohnetParser had higher baselines(around 70%). The highest error reduction wasachieved by MSTParser (19.0%), followed by Malt-Parser (10.4%), BerkeleyParser (10.1%), and finallyBohnetParser (7.5%). The result for MSTParser wassurprising: although it had the second-highest base-line overall UAS, it shared the lowest baseline coor-dinating conjunction UAS and had the highest er-ror reduction with parse correction. An explana-tion for this result is that the annotation scheme forcoordination structure in the dependency FTB hasthe first conjunct governing the coordinating con-junction, which governs the second conjunct. SinceMSTParser is limited to sibling 2-edge factors (cf.section 3), it is unable to jointly consider a full coor-dination structure. BohnetParser, which uses general2-edge factors, can consider full coordination struc-tures and consequently has a much higher baselinecoordinating conjunction UAS than MSTParser.

Parse correction achieved significant but mod-est improvements in pp-attachment performance un-der thespecializedconfiguration for MaltParser andBerkeleyParser. However, parse correction did notsignificantly improve pp-attachment performancefor MSTParser or BohnetParser, the two parsers thathad the highest baseline preposition UAS (around

Page 11: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

Modification Typew→c c→w w→w Mods

BerkeleyCoords 40 14 33 10.9 %Preps 118 39 41 3.5 %

Overall 228 67 104 1.3 %

BohnetCoords 32 15 33 10.0 %Preps 52 46 32 2.3 %

Overall 150 121 130 1.1 %

MaltCoords 55 21 56 16.5 %Preps 149 50 76 4.8 %

Overall 390 172 293 2.4 %

MSTCoords 80 20 51 18.9 %Preps 64 45 26 2.4 %

Overall 183 88 117 1.1 %

Table 3: Breakdown of modifications made under thespecializedconfiguration for each parser, by dependenttype. w→c is wrong-to-correct,c→w is correct-to-wrong,w→w is wrong-to-wrong, and Mods is the per-centage of tokens modified.

86%). These results are a bit disappointing, but theysuggest that there may be a performance ceiling forpp-attachment beyond which rich lexical informa-tion (syntactic and semantic) or full sentence con-texts are needed. For English, the average humanperformance on pp-attachment for the(v, n1, p, n2)problem formulation is just88.2% when given onlythe four head-words, but increases to93.2% whengiven the full sentence (Ratnaparkhi et al., 1994).If similar levels of human performance exist forFrench, additional sources of information may beneeded to improve pp-attachment performance.

In addition to evaluating UAS improvements forparse correction, we took a closer look at the bestcorrective configuration (specialized) and analyzedthe types of attachment modifications made (Ta-ble 3). In most cases there were around2−3 timesas many error-correcting modifications (w→c) aserror-creating modifications (c→w), and the overall% of tokens modified was very low overall (around1-2%). Parse correction is thus conservative in thenumber of modifications made, and rather accuratewhen it does decide to modify an attachment.

Finally, we compared the running times of thefour parsers, as well as that of parse correction, onthe test set using a 2.66 GHz Intel Core 2 Duo ma-chine. BerkeleyParser took 600s, BohnetParser took450s using both cores (800s using a single core),

MaltParser took 45s, and MSTParser took 1000s. Arough version of parse correction in thespecializedconfiguration took around 200s (for each parser). Aninteresting result is that parse correction improvesMaltParser the most while retaining an overall timecomplexity ofO(n), compared toO(n3) or higherfor the other parsers. This suggests that linear-timetransition-based parsing and parse correction couldcombine to form an attractive system that improvesparsing performance while retaining high speed.

7 Conclusion

We have developed a parse correction framework forsyntactic dependency parsing that uses specializedmodels for difficult attachment types. Candidategovernors for a given dependent are identified in aneighborhood around the predicted governor, and ascoring function selects the best governor. We useddiscriminative linear ranking models with featuresencoding syntactic context, and we tested parse cor-rection on coordination, pp-attachment, and genericdependencies in the outputs of four representativestatistical dependency parsers for French. Parse cor-rection achieved improvements in unlabeled attach-ment score for three out of the four parsers, withMaltParser seeing the greatest improvement. Sinceboth MaltParser and parse correction run inO(n)time, a combined system could prove useful in situ-ations where high parsing speed is required.

Future work on parse correction might focus ondeveloping specialized models for other difficultattachment types, such as verb-phrase attachment(verb dependents account for around 15% of incor-rect attachments across all four parsers). Also, se-lectional preferences and subcategorization frames(from hand-built resources or extracted using distri-butional methods) could make for useful features inthe pp-attachment corrective model; we suspect thatricher lexical information is needed in order to in-crease the currently modest improvements achievedby parse correction on pp-attachment.

Acknowledgments

We would like to thank Pascal Denis for his help us-ing the Polka library, and Alexis Nasr for his adviceand comments. This work was partially funded bythe ANR project Sequoia ANR-08-EMER-013.

Page 12: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

References

A. Abeill e and N. Barrier. 2004. Enriching a Frenchtreebank. InProceedings of the Fourth InternationalConference on Language Resources and Evaluation,Lisbon, Portugal, May.

G. Attardi and M. Ciaramita. 2007. Tree revision learn-ing for dependency parsing. InProceedings of theConference of the North American Chapter of the As-sociation for Computational Linguistics, pages 388–395, Rochester, New York, April.

G. Attardi and F. Dell’Orletta. 2009. Reverse revisionand linear tree combination for dependency parsing.In Proceedings of the 2009 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 261–264, Boulder, Colorado,June.

M. Atterer and H. Schutze. 2007. Prepositional phraseattachment without oracles.Computational Linguis-tics, 33(4):469–476.

B. Bohnet. 2010. Very high accuracy and fast depen-dency parsing is not a contradiction. InProceedings ofthe 23rd International Conference on ComputationalLinguistics, pages 89–97, Beijing, China, August.

M. Candito, B. Crabbe, and P. Denis. 2010a. StatisticalFrench dependency parsing: Treebank conversion andfirst results. InProceedings of the Seventh Interna-tional Conference on Language Resources and Evalu-ation, Valetta, Malta, May.

M. Candito, J. Nivre, P. Denis, and E. Henestroza An-guiano. 2010b. Benchmarking of statistical depen-dency parsers for French. InProceedings of the 23rdInternational Conference on Computational Linguis-tics, pages 108–116, Beijing, China, August.

X. Carreras. 2007. Experiments with a higher-orderprojective dependency parser. InProceedings of theCoNLL Shared Task Session of EMNLP-CoNLL, pages957–961, Prague, Czech Republic, June.

E. Charniak and M. Johnson. 2005. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. InProceedings of the 43rd Annual Meeting of the Asso-ciation for Computational Linguistics, pages 173–180,Ann Arbor, Michigan, June.

M. Collins and T. Koo. 2005. Discriminative rerankingfor natural language parsing.Computational Linguis-tics, 31(1):25–70.

K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz,and Y. Singer. 2006. Online passive-aggressive algo-rithms. The Journal of Machine Learning Research,7:551–585.

P. Denis and J. Baldridge. 2007. A ranking approachto pronoun resolution. InProceedings of the 20th In-ternational Joint Conference on Artifical intelligence,pages 1588–1593, Hyderabad, India, January.

P. Denis and B. Sagot. 2009. Coupling an annotated cor-pus and a morphosyntactic lexicon for state-of-the-artPOS tagging with less human effort. InProceedingsof the 23rd Pacific Asia Conference on Language, In-formation and Computation, Hong Kong, China, De-cember.

J.M. Eisner. 1996. Three new probabilistic models fordependency parsing: An exploration. InProceedingsof the 16th conference on Computational linguistics-Volume 1, pages 340–345, Santa Cruz, California, Au-gust.

K. Hall and V. Novak. 2005. Corrective modeling fornon-projective dependency parsing. InProceedingsof the Ninth International Workshop on Parsing Tech-nologies, pages 42–52, Vancouver, British Columbia,October.

K. Hara, M. Shimbo, H. Okuma, and Y. Matsumoto.2009. Coordinate structure analysis with global struc-tural constraints and alignment-based local features.In Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing ofthe AFNLP, pages 967–975, Suntec, Singapore, Au-gust.

D. Hogan. 2007. Coordinate noun phrase disambigua-tion in a generative parsing model. InIn Proceed-ings of the 45th Annual Meeting of the Associationfor Computational Linguistics, number 1, page 680,Prague, Czech Republic, June.

R. Johansson and P. Nugues. 2006. Investigatingmultilingual dependency parsing. InProceedings ofthe Tenth Conference on Computational Natural Lan-guage Learning, pages 206–210, New York City, NewYork, June.

T. Koo and M. Collins. 2010. Efficient third-order de-pendency parsers. InProceedings of the 48th AnnualMeeting of the Association for Computational Linguis-tics, pages 1–11, Uppsala, Sweden, July.

T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilis-tic CFG with latent annotations. InProceedings ofthe 43rd Annual Meeting on Association for Computa-tional Linguistics, pages 75–82, Ann Arbor, Michigan,June.

R. McDonald and F. Pereira. 2006. Online learningof approximate dependency parsing algorithms. InProceedings of the 11th Conference of the EuropeanChapter of the Association for Computational Linguis-tics, pages 81–88, Trento, Italy, April.

R. McDonald, K. Crammer, and F. Pereira. 2005. On-line large-margin training of dependency parsers. InProceedings of the 43rd Annual Meeting on Associa-tion for Computational Linguistics, pages 91–98, AnnArbor, Michigan, June.

Page 13: Parse correction with specialized models for di cult ... · model or specialized models tailored to dif-ficult attachment types like coordination and pp-attachment. Our experiments

J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit,S. Kubler, S. Marinov, and E. Marsi. 2007. Malt-Parser: A language-independent system for data-driven dependency parsing.Natural Language Engi-neering, 13(02):95–135.

J. Nivre. 2003. An efficient algorithm for projective de-pendency parsing. InProceedings of the 8th Interna-tional Workshop on Parsing Technologies, pages 149–160, Nancy, France, April.

M. Olteanu and D. Moldovan. 2005. PP-attachment dis-ambiguation using large context. InProceedings ofthe Conference on Human Language Technology andEmpirical Methods in Natural Language Processing,pages 273–280, Vancouver, British Columbia, Octo-ber.

P. Pantel and D. Lin. 2000. An unsupervised approachto prepositional phrase attachment using contextuallysimilar words. InProceedings of the 38th AnnualMeeting of the Association for Computational Linguis-tics, volume 38, pages 101–108, Hong Kong, October.

S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006.Learning accurate, compact, and interpretable tree an-notation. In Proceedings of the 21st InternationalConference on Computational Linguistics and the 44thannual meeting of the Association for ComputationalLinguistics, pages 433–440, Sydney, Australia, July.

A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A max-imum entropy model for prepositional phrase attach-ment. InProceedings of the Workshop on Human Lan-guage Technology, pages 250–255, Plainsboro, NewJersey, March.

P. Resnik. 1999. Semantic similarity in a taxonomy: Aninformation-based measure and its application to prob-lems of ambiguity in natural language.Journal of Ar-tificial Intelligence Research, 11(95):130.

B. Sagot. 2010. The Lefff, a freely available, accurateand large-coverage lexicon for French. InProceedingsof the Seventh International Conference on LanguageResources and Evaluation, Valetta, Malta, May.

M. Shimbo and K. Hara. 2007. A discriminative learn-ing model for coordinate conjunctions. InProceedingsof the 2007 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, pages 610–619, Prague,Czech Republic, June.

H. Yamada and Y. Matsumoto. 2003. Statistical depen-dency analysis with support vector machines. InPro-ceedings of the 8th International Workshop on ParsingTechnologies, pages 195–206, Nancy, France, April.